129 lines
4.2 KiB
ReStructuredText
129 lines
4.2 KiB
ReStructuredText
.. _labeled_faces_in_the_wild_dataset:
|
|
|
|
The Labeled Faces in the Wild face recognition dataset
|
|
------------------------------------------------------
|
|
|
|
This dataset is a collection of JPEG pictures of famous people collected
|
|
over the internet, all details are available on the official website:
|
|
|
|
http://vis-www.cs.umass.edu/lfw/
|
|
|
|
Each picture is centered on a single face. The typical task is called
|
|
Face Verification: given a pair of two pictures, a binary classifier
|
|
must predict whether the two images are from the same person.
|
|
|
|
An alternative task, Face Recognition or Face Identification is:
|
|
given the picture of the face of an unknown person, identify the name
|
|
of the person by referring to a gallery of previously seen pictures of
|
|
identified persons.
|
|
|
|
Both Face Verification and Face Recognition are tasks that are typically
|
|
performed on the output of a model trained to perform Face Detection. The
|
|
most popular model for Face Detection is called Viola-Jones and is
|
|
implemented in the OpenCV library. The LFW faces were extracted by this
|
|
face detector from various online websites.
|
|
|
|
**Data Set Characteristics:**
|
|
|
|
================= =======================
|
|
Classes 5749
|
|
Samples total 13233
|
|
Dimensionality 5828
|
|
Features real, between 0 and 255
|
|
================= =======================
|
|
|
|
|details-start|
|
|
**Usage**
|
|
|details-split|
|
|
|
|
``scikit-learn`` provides two loaders that will automatically download,
|
|
cache, parse the metadata files, decode the jpeg and convert the
|
|
interesting slices into memmapped numpy arrays. This dataset size is more
|
|
than 200 MB. The first load typically takes more than a couple of minutes
|
|
to fully decode the relevant part of the JPEG files into numpy arrays. If
|
|
the dataset has been loaded once, the following times the loading times
|
|
less than 200ms by using a memmapped version memoized on the disk in the
|
|
``~/scikit_learn_data/lfw_home/`` folder using ``joblib``.
|
|
|
|
The first loader is used for the Face Identification task: a multi-class
|
|
classification task (hence supervised learning)::
|
|
|
|
>>> from sklearn.datasets import fetch_lfw_people
|
|
>>> lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
|
|
|
|
>>> for name in lfw_people.target_names:
|
|
... print(name)
|
|
...
|
|
Ariel Sharon
|
|
Colin Powell
|
|
Donald Rumsfeld
|
|
George W Bush
|
|
Gerhard Schroeder
|
|
Hugo Chavez
|
|
Tony Blair
|
|
|
|
The default slice is a rectangular shape around the face, removing
|
|
most of the background::
|
|
|
|
>>> lfw_people.data.dtype
|
|
dtype('float32')
|
|
|
|
>>> lfw_people.data.shape
|
|
(1288, 1850)
|
|
|
|
>>> lfw_people.images.shape
|
|
(1288, 50, 37)
|
|
|
|
Each of the ``1140`` faces is assigned to a single person id in the ``target``
|
|
array::
|
|
|
|
>>> lfw_people.target.shape
|
|
(1288,)
|
|
|
|
>>> list(lfw_people.target[:10])
|
|
[5, 6, 3, 1, 0, 1, 3, 4, 3, 0]
|
|
|
|
The second loader is typically used for the face verification task: each sample
|
|
is a pair of two picture belonging or not to the same person::
|
|
|
|
>>> from sklearn.datasets import fetch_lfw_pairs
|
|
>>> lfw_pairs_train = fetch_lfw_pairs(subset='train')
|
|
|
|
>>> list(lfw_pairs_train.target_names)
|
|
['Different persons', 'Same person']
|
|
|
|
>>> lfw_pairs_train.pairs.shape
|
|
(2200, 2, 62, 47)
|
|
|
|
>>> lfw_pairs_train.data.shape
|
|
(2200, 5828)
|
|
|
|
>>> lfw_pairs_train.target.shape
|
|
(2200,)
|
|
|
|
Both for the :func:`sklearn.datasets.fetch_lfw_people` and
|
|
:func:`sklearn.datasets.fetch_lfw_pairs` function it is
|
|
possible to get an additional dimension with the RGB color channels by
|
|
passing ``color=True``, in that case the shape will be
|
|
``(2200, 2, 62, 47, 3)``.
|
|
|
|
The :func:`sklearn.datasets.fetch_lfw_pairs` datasets is subdivided into
|
|
3 subsets: the development ``train`` set, the development ``test`` set and
|
|
an evaluation ``10_folds`` set meant to compute performance metrics using a
|
|
10-folds cross validation scheme.
|
|
|
|
|details-end|
|
|
|
|
.. topic:: References:
|
|
|
|
* `Labeled Faces in the Wild: A Database for Studying Face Recognition
|
|
in Unconstrained Environments.
|
|
<http://vis-www.cs.umass.edu/lfw/lfw.pdf>`_
|
|
Gary B. Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller.
|
|
University of Massachusetts, Amherst, Technical Report 07-49, October, 2007.
|
|
|
|
|
|
.. topic:: Examples:
|
|
|
|
* :ref:`sphx_glr_auto_examples_applications_plot_face_recognition.py`
|