{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#### nie robimy 2 nowych linii w bloku funkcji. sentences[::2] oraz sentences[1::2] powinny być przypisane do osobnych zmiennych" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Pies ten pochodzi z południowych Chin, z terenów prowincji Guangdong! \n", "Został rozpropagowany i hodowany w celach wystawowych przez hodowców w USA. \n", "Nazwa psa, pochodząca z chińskiego, oznacza dosłownie piaszczysta skóra. \n", "['Pies ten pochodzi z południowych Chin, z terenów prowincji Guangdong! ', 'Został rozpropagowany i hodowany w celach wystawowych przez hodowców w USA. ', 'Nazwa psa, pochodząca z chińskiego, oznacza dosłownie piaszczysta skóra. ']\n" ] } ], "source": [ "import re\n", "tekst = \"Pies ten pochodzi z południowych Chin, z terenów prowincji Guangdong! Został rozpropagowany i hodowany w celach wystawowych przez hodowców w USA. Nazwa psa, pochodząca z chińskiego, oznacza dosłownie piaszczysta skóra. Chart polski polska rasa psa myśliwskiego, znana prawdopodobnie od czasów Galla Anonima, zaliczana do grupy chartów.\"\n", "def split_sentences(text):\n", " sentences = re.split(r'([.!?]\\s+)(?=[A-Z])', text)\n", "\n", "\n", " full_sentences = [''.join(pair) for pair in zip(sentences[::2], sentences[1::2])]\n", "\n", "\n", " for sentence in full_sentences:\n", " print(sentence)\n", " print(full_sentences)\n", "split_sentences(tekst)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Niewłaściwa nazwa funkcji switch_letter (robi coś innego, niż nazwa na to wskazuje). Linijka z sum jest nieczytelna." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--- faja.\n" ] } ], "source": [ "text = \"kurde faja.\"\n", "\n", "vulgar_words_base = [\"kurd\", \"choler\"]\n", "\n", "def switch_letter(word, vulgar_word_list):\n", " word = word.lower()\n", " for bad_word in vulgar_word_list:\n", " switched_letters = sum(1 for a, b in zip(word, bad_word) if a != b)\n", " if switched_letters == 1:\n", " return True\n", " return False\n", "\n", "def censor_text(text):\n", " pattern = re.compile(r'[^\\s]*(' + '|'.join([f'{word}' for word in vulgar_words_base]) + r')[^\\s]*', re.IGNORECASE)\n", " censored_text = pattern.sub(\"---\", text)\n", "\n", " censored_text_list = censored_text.split()\n", " \n", " for i, word in enumerate(censored_text_list):\n", " if switch_letter(word, vulgar_words_base):\n", " censored_text_list[i] = \"---\"\n", " final_censored_text = \" \".join(censored_text_list)\n", "\n", " return final_censored_text\n", "\n", "print(censor_text(text))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "switch_letter(\"kurcze\", [\"kurzce\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Jeżeli nie ma takiej konieczności nie iterujemy po rozdzielonym na słowa tekście, tylko na całym tekście." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Siała baba mak.\n", "Czy wiedziała jak?\n", "Dziadek wiedział, nie powiedział, a to było tak!\n" ] } ], "source": [ "# Solution 2\n", "text = 'Siała baba mak. Czy wiedziała jak? Dziadek wiedział, nie powiedział, a to było tak!'\n", "sentences = []\n", "\n", "def split_sentences(text):\n", " sentence = ''\n", " for word in text.split():\n", " x = re.search(r'[a-zA-Z0-9]+[.?!]', word)\n", " if x is None:\n", " sentence += f'{word} '\n", " else:\n", " sentence += word\n", " sentences.append(sentence)\n", " sentence = ''\n", " for result in sentences:\n", " print(result)\n", "\n", "\n", "split_sentences(text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Nie stosujemy zapisu if {zmienna}, tylko if {zmienna} is True/False. Kod dla danego warunku przenosimy do nowej linii" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "import re\n", "\n", "def validate_name(name):\n", " valid = re.match(r'^[A-Z][a-z]{1,}',name)\n", " if valid: return True\n", " else: return False\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Przykład właściwego zastosowania komentarza" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "def censor_text(text):\n", " prefixes = r'(do|na|o|od|pod|po|prze|przy|roz|s|u|w|y|za|z|u)*'\n", "\n", " # profanities according to prof. Jerzy Bralczyk\n", " profanities = [ \n", " rf'\\b{prefixes}(pierd\\w*)\\b',\n", " ]\n", "\n", " profanity_pattern = re.compile('|'.join(profanities), re.IGNORECASE)\n", "\n", " return profanity_pattern.sub('---', text)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# jeżeli ten if określa 3 warianty na tym samym poziomie, to nie stosujemy zagnieżdżenia warunków\n", "if [ \"$positive_count\" -gt \"$negative_count\" ]; then\n", " echo \"wydzwiek pozytywny\"\n", "else\n", " if [ \"$negative_count\" -gt \"$positive_count\" ]; then\n", " echo \"wydzwiek: negatywny\"\n", " else\n", " echo \"wydzwiek: neutralny\"\n", " fi\n", "fi\n", "\n", "\n", "# ten else nigdy się nie wywoła - nie powinno go być\n", "if [ $positive_count -gt $negative_count ]\n", " then echo \"Positive\"\n", "elif [ $positive_count -lt $negative_count ]\n", " then echo \"Negative\"\n", "elif [ $positive_count -eq $negative_count ]\n", " then echo \"Neutral\"\n", "else\n", " echo \"Error\" # to nie istnieje\n", "fi\n", "\n", "\n", "# positive - taki błąd mocno rzuca się w oczy (mimo że program działa)\n", "POZITIVE=\"positive-words.txt\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Notebook 05" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# algorytm wzięty z pseudokodu z prezentacji profesora Jassema\n", "def maxmatch_text_split(text, vocabulary):\n", " if text == \"\":\n", " return []\n", " for i in range(len(text)-1, -1, -1):\n", " firstword = text[0:i+1] # nie piszemy [0:x] tylko [:x]\n", " reminder = text[i+1:]\n", " if firstword in vocabulary:\n", " return [firstword] + maxmatch_text_split(reminder, vocabulary)\n", " firstword = text[0]\n", " reminder = text[1]\n", " return [firstword] + maxmatch_text_split(reminder, vocabulary)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def create_bpe_tokenizer(text, max_vocab_length):\n", " nfoiwanfoiwa\n", " \n", " for x in range(10):\n", " nfwoiaf\n", " \n", " awfnoia\n", " if noiawniofa:\n", " iognioe\n", " else:\n", " nawoinoigagna\n", " fniaw..\n", "\n", " return 0\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import string\n", "from collections import Counter\n", "import re\n", "\n", "\n", "def create_bpe_tokenizer(text, max_vocab_length):\n", " text = (\"\".join(x for x in text if x not in string.punctuation)).lower()\n", " vocabulary = list(set([x for x in text]))\n", " while len(vocabulary)