PCQRSCANER/venv/Lib/site-packages/nltk/test/classify.doctest

.. Copyright (C) 2001-2019 NLTK Project
.. For license information, see LICENSE.TXT

=============
 Classifiers
=============

Classifiers label tokens with category labels (or *class labels*).
Typically, labels are represented with strings (such as ``"health"``
or ``"sports"``.  In NLTK, classifiers are defined using classes that
implement the `ClassifyI` interface:

    >>> import nltk
    >>> nltk.usage(nltk.classify.ClassifierI)
    ClassifierI supports the following operations:
      - self.classify(featureset)
      - self.classify_many(featuresets)
      - self.labels()
      - self.prob_classify(featureset)
      - self.prob_classify_many(featuresets)

NLTK defines several classifier classes:

- `ConditionalExponentialClassifier`
- `DecisionTreeClassifier`
- `MaxentClassifier`
- `NaiveBayesClassifier`
- `WekaClassifier`

Classifiers are typically created by training them on a training
corpus.


Regression Tests
~~~~~~~~~~~~~~~~

We define a very simple training corpus with 3 binary features: ['a',
'b', 'c'], and are two labels: ['x', 'y'].  We use a simple feature set so
that the correct answers can be calculated analytically (although we
haven't done this yet for all tests).

    >>> train = [
    ...     (dict(a=1,b=1,c=1), 'y'),
    ...     (dict(a=1,b=1,c=1), 'x'),
    ...     (dict(a=1,b=1,c=0), 'y'),
    ...     (dict(a=0,b=1,c=1), 'x'),
    ...     (dict(a=0,b=1,c=1), 'y'),
    ...     (dict(a=0,b=0,c=1), 'y'),
    ...     (dict(a=0,b=1,c=0), 'x'),
    ...     (dict(a=0,b=0,c=0), 'x'),
    ...     (dict(a=0,b=1,c=1), 'y'),
    ...     ]
    >>> test = [
    ...     (dict(a=1,b=0,c=1)), # unseen
    ...     (dict(a=1,b=0,c=0)), # unseen
    ...     (dict(a=0,b=1,c=1)), # seen 3 times, labels=y,y,x
    ...     (dict(a=0,b=1,c=0)), # seen 1 time, label=x
    ...     ]

Test the Naive Bayes classifier:

    >>> classifier = nltk.classify.NaiveBayesClassifier.train(train)
    >>> sorted(classifier.labels())
    ['x', 'y']
    >>> classifier.classify_many(test)
    ['y', 'x', 'y', 'x']
    >>> for pdist in classifier.prob_classify_many(test):
    ...     print('%.4f %.4f' % (pdist.prob('x'), pdist.prob('y')))
    0.3203 0.6797
    0.5857 0.4143
    0.3792 0.6208
    0.6470 0.3530
    >>> classifier.show_most_informative_features()
    Most Informative Features
                           c = 0                   x : y      =      2.0 : 1.0
                           c = 1                   y : x      =      1.5 : 1.0
                           a = 1                   y : x      =      1.4 : 1.0
                           b = 0                   x : y      =      1.2 : 1.0
                           a = 0                   x : y      =      1.2 : 1.0
                           b = 1                   y : x      =      1.1 : 1.0

Test the Decision Tree classifier:

    >>> classifier = nltk.classify.DecisionTreeClassifier.train(
    ...     train, entropy_cutoff=0,
    ...                                                support_cutoff=0)
    >>> sorted(classifier.labels())
    ['x', 'y']
    >>> print(classifier)
    c=0? .................................................. x
      a=0? ................................................ x
      a=1? ................................................ y
    c=1? .................................................. y
    <BLANKLINE>
    >>> classifier.classify_many(test)
    ['y', 'y', 'y', 'x']
    >>> for pdist in classifier.prob_classify_many(test):
    ...     print('%.4f %.4f' % (pdist.prob('x'), pdist.prob('y')))
    Traceback (most recent call last):
      . . .
    NotImplementedError

Test SklearnClassifier, which requires the scikit-learn package.

    >>> from nltk.classify import SklearnClassifier
    >>> from sklearn.naive_bayes import BernoulliNB
    >>> from sklearn.svm import SVC
    >>> train_data = [({"a": 4, "b": 1, "c": 0}, "ham"),
    ...               ({"a": 5, "b": 2, "c": 1}, "ham"),
    ...               ({"a": 0, "b": 3, "c": 4}, "spam"),
    ...               ({"a": 5, "b": 1, "c": 1}, "ham"),
    ...               ({"a": 1, "b": 4, "c": 3}, "spam")]
    >>> classif = SklearnClassifier(BernoulliNB()).train(train_data)
    >>> test_data = [{"a": 3, "b": 2, "c": 1},
    ...              {"a": 0, "b": 3, "c": 7}]
    >>> classif.classify_many(test_data)
    ['ham', 'spam']
    >>> classif = SklearnClassifier(SVC(), sparse=False).train(train_data)
    >>> classif.classify_many(test_data)
    ['ham', 'spam']

Test the Maximum Entropy classifier training algorithms; they should all
generate the same results.

    >>> def print_maxent_test_header():
    ...     print(' '*11+''.join(['      test[%s]  ' % i
    ...                           for i in range(len(test))]))
    ...     print(' '*11+'     p(x)  p(y)'*len(test))
    ...     print('-'*(11+15*len(test)))

    >>> def test_maxent(algorithm):
    ...     print('%11s' % algorithm, end=' ')
    ...     try:
    ...         classifier = nltk.classify.MaxentClassifier.train(
    ...                         train, algorithm, trace=0, max_iter=1000)
    ...     except Exception as e:
    ...         print('Error: %r' % e)
    ...         return
    ...
    ...     for featureset in test:
    ...         pdist = classifier.prob_classify(featureset)
    ...         print('%8.2f%6.2f' % (pdist.prob('x'), pdist.prob('y')), end=' ')
    ...     print()

    >>> print_maxent_test_header(); test_maxent('GIS'); test_maxent('IIS')
                     test[0]        test[1]        test[2]        test[3]
                    p(x)  p(y)     p(x)  p(y)     p(x)  p(y)     p(x)  p(y)
    -----------------------------------------------------------------------
            GIS     0.16  0.84     0.46  0.54     0.41  0.59     0.76  0.24
            IIS     0.16  0.84     0.46  0.54     0.41  0.59     0.76  0.24

    >>> test_maxent('MEGAM'); test_maxent('TADM') # doctest: +SKIP
            MEGAM   0.16  0.84     0.46  0.54     0.41  0.59     0.76  0.24
            TADM    0.16  0.84     0.46  0.54     0.41  0.59     0.76  0.24


Regression tests for TypedMaxentFeatureEncoding
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    >>> from nltk.classify import maxent
    >>> train = [
    ...     ({'a': 1, 'b': 1, 'c': 1}, 'y'),
    ...     ({'a': 5, 'b': 5, 'c': 5}, 'x'),
    ...     ({'a': 0.9, 'b': 0.9, 'c': 0.9}, 'y'),
    ...     ({'a': 5.5, 'b': 5.4, 'c': 5.3}, 'x'),
    ...     ({'a': 0.8, 'b': 1.2, 'c': 1}, 'y'),
    ...     ({'a': 5.1, 'b': 4.9, 'c': 5.2}, 'x')
    ... ]

    >>> test = [
    ...     {'a': 1, 'b': 0.8, 'c': 1.2},
    ...     {'a': 5.2, 'b': 5.1, 'c': 5}
    ... ]

    >>> encoding = maxent.TypedMaxentFeatureEncoding.train(
    ...     train, count_cutoff=3, alwayson_features=True)

    >>> classifier = maxent.MaxentClassifier.train(
    ...     train, bernoulli=False, encoding=encoding, trace=0)

    >>> classifier.classify_many(test)
    ['y', 'x']
3 2019-12-22 21:51:47 +01:00			`.. Copyright (C) 2001-2019 NLTK Project`
			`.. For license information, see LICENSE.TXT`

			`=============`
			`Classifiers`
			`=============`

			`Classifiers label tokens with category labels (or class labels).`
			Typically, labels are represented with strings (such as ``"health"``
			or ``"sports"``. In NLTK, classifiers are defined using classes that
			implement the `ClassifyI` interface:

			`>>> import nltk`
			`>>> nltk.usage(nltk.classify.ClassifierI)`
			`ClassifierI supports the following operations:`
			`- self.classify(featureset)`
			`- self.classify_many(featuresets)`
			`- self.labels()`
			`- self.prob_classify(featureset)`
			`- self.prob_classify_many(featuresets)`

			`NLTK defines several classifier classes:`

			- `ConditionalExponentialClassifier`
			- `DecisionTreeClassifier`
			- `MaxentClassifier`
			- `NaiveBayesClassifier`
			- `WekaClassifier`

			`Classifiers are typically created by training them on a training`
			`corpus.`


			`Regression Tests`
			`~~~~~~~~~~~~~~~~`

			`We define a very simple training corpus with 3 binary features: ['a',`
			`'b', 'c'], and are two labels: ['x', 'y']. We use a simple feature set so`
			`that the correct answers can be calculated analytically (although we`
			`haven't done this yet for all tests).`

			`>>> train = [`
			`... (dict(a=1,b=1,c=1), 'y'),`
			`... (dict(a=1,b=1,c=1), 'x'),`
			`... (dict(a=1,b=1,c=0), 'y'),`
			`... (dict(a=0,b=1,c=1), 'x'),`
			`... (dict(a=0,b=1,c=1), 'y'),`
			`... (dict(a=0,b=0,c=1), 'y'),`
			`... (dict(a=0,b=1,c=0), 'x'),`
			`... (dict(a=0,b=0,c=0), 'x'),`
			`... (dict(a=0,b=1,c=1), 'y'),`
			`... ]`
			`>>> test = [`
			`... (dict(a=1,b=0,c=1)), # unseen`
			`... (dict(a=1,b=0,c=0)), # unseen`
			`... (dict(a=0,b=1,c=1)), # seen 3 times, labels=y,y,x`
			`... (dict(a=0,b=1,c=0)), # seen 1 time, label=x`
			`... ]`

			`Test the Naive Bayes classifier:`

			`>>> classifier = nltk.classify.NaiveBayesClassifier.train(train)`
			`>>> sorted(classifier.labels())`
			`['x', 'y']`
			`>>> classifier.classify_many(test)`
			`['y', 'x', 'y', 'x']`
			`>>> for pdist in classifier.prob_classify_many(test):`
			`... print('%.4f %.4f' % (pdist.prob('x'), pdist.prob('y')))`
			`0.3203 0.6797`
			`0.5857 0.4143`
			`0.3792 0.6208`
			`0.6470 0.3530`
			`>>> classifier.show_most_informative_features()`
			`Most Informative Features`
			`c = 0 x : y = 2.0 : 1.0`
			`c = 1 y : x = 1.5 : 1.0`
			`a = 1 y : x = 1.4 : 1.0`
			`b = 0 x : y = 1.2 : 1.0`
			`a = 0 x : y = 1.2 : 1.0`
			`b = 1 y : x = 1.1 : 1.0`

			`Test the Decision Tree classifier:`

			`>>> classifier = nltk.classify.DecisionTreeClassifier.train(`
			`... train, entropy_cutoff=0,`
			`... support_cutoff=0)`
			`>>> sorted(classifier.labels())`
			`['x', 'y']`
			`>>> print(classifier)`
			`c=0? .................................................. x`
			`a=0? ................................................ x`
			`a=1? ................................................ y`
			`c=1? .................................................. y`
			`<BLANKLINE>`
			`>>> classifier.classify_many(test)`
			`['y', 'y', 'y', 'x']`
			`>>> for pdist in classifier.prob_classify_many(test):`
			`... print('%.4f %.4f' % (pdist.prob('x'), pdist.prob('y')))`
			`Traceback (most recent call last):`
			`. . .`
			`NotImplementedError`

			`Test SklearnClassifier, which requires the scikit-learn package.`

			`>>> from nltk.classify import SklearnClassifier`
			`>>> from sklearn.naive_bayes import BernoulliNB`
			`>>> from sklearn.svm import SVC`
			`>>> train_data = [({"a": 4, "b": 1, "c": 0}, "ham"),`
			`... ({"a": 5, "b": 2, "c": 1}, "ham"),`
			`... ({"a": 0, "b": 3, "c": 4}, "spam"),`
			`... ({"a": 5, "b": 1, "c": 1}, "ham"),`
			`... ({"a": 1, "b": 4, "c": 3}, "spam")]`
			`>>> classif = SklearnClassifier(BernoulliNB()).train(train_data)`
			`>>> test_data = [{"a": 3, "b": 2, "c": 1},`
			`... {"a": 0, "b": 3, "c": 7}]`
			`>>> classif.classify_many(test_data)`
			`['ham', 'spam']`
			`>>> classif = SklearnClassifier(SVC(), sparse=False).train(train_data)`
			`>>> classif.classify_many(test_data)`
			`['ham', 'spam']`

			`Test the Maximum Entropy classifier training algorithms; they should all`
			`generate the same results.`

			`>>> def print_maxent_test_header():`
			`... print(' '*11+''.join([' test[%s] ' % i`
			`... for i in range(len(test))]))`
			`... print(' '11+' p(x) p(y)'len(test))`
			`... print('-'(11+15len(test)))`

			`>>> def test_maxent(algorithm):`
			`... print('%11s' % algorithm, end=' ')`
			`... try:`
			`... classifier = nltk.classify.MaxentClassifier.train(`
			`... train, algorithm, trace=0, max_iter=1000)`
			`... except Exception as e:`
			`... print('Error: %r' % e)`
			`... return`
			`...`
			`... for featureset in test:`
			`... pdist = classifier.prob_classify(featureset)`
			`... print('%8.2f%6.2f' % (pdist.prob('x'), pdist.prob('y')), end=' ')`
			`... print()`

			`>>> print_maxent_test_header(); test_maxent('GIS'); test_maxent('IIS')`
			`test[0] test[1] test[2] test[3]`
			`p(x) p(y) p(x) p(y) p(x) p(y) p(x) p(y)`
			`-----------------------------------------------------------------------`
			`GIS 0.16 0.84 0.46 0.54 0.41 0.59 0.76 0.24`
			`IIS 0.16 0.84 0.46 0.54 0.41 0.59 0.76 0.24`

			`>>> test_maxent('MEGAM'); test_maxent('TADM') # doctest: +SKIP`
			`MEGAM 0.16 0.84 0.46 0.54 0.41 0.59 0.76 0.24`
			`TADM 0.16 0.84 0.46 0.54 0.41 0.59 0.76 0.24`



			`Regression tests for TypedMaxentFeatureEncoding`
			`~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`

			`>>> from nltk.classify import maxent`
			`>>> train = [`
			`... ({'a': 1, 'b': 1, 'c': 1}, 'y'),`
			`... ({'a': 5, 'b': 5, 'c': 5}, 'x'),`
			`... ({'a': 0.9, 'b': 0.9, 'c': 0.9}, 'y'),`
			`... ({'a': 5.5, 'b': 5.4, 'c': 5.3}, 'x'),`
			`... ({'a': 0.8, 'b': 1.2, 'c': 1}, 'y'),`
			`... ({'a': 5.1, 'b': 4.9, 'c': 5.2}, 'x')`
			`... ]`

			`>>> test = [`
			`... {'a': 1, 'b': 0.8, 'c': 1.2},`
			`... {'a': 5.2, 'b': 5.1, 'c': 5}`
			`... ]`

			`>>> encoding = maxent.TypedMaxentFeatureEncoding.train(`
			`... train, count_cutoff=3, alwayson_features=True)`

			`>>> classifier = maxent.MaxentClassifier.train(`
			`... train, bernoulli=False, encoding=encoding, trace=0)`

			`>>> classifier.classify_many(test)`
			`['y', 'x']`