imported>Katya: Migrated current public revision from wiki.cs.hse.ru

2016-11-05T19:48:54Z

Migrated current public revision from wiki.cs.hse.ru

Новая страница

Ekaterina Chernyak, Dmitry Ilvovsky

== How many words? ==

"The rain in Spain stays mainly in the plain."
9 '''tokens''': The, rain, in, Spain, stays, mainly, in, the, plain
7 (or 8) '''types''': T = the rain, in, Spain, stays, mainly, plain

=== Type and token ===

''Type'' is an element of the vocabulary.

''Token'' is an instance of that type in the text.

N = number of tokens;

V - vocabulary (i.e. all types);

|V| = size of vocabulary (i.e. number of types).

How are N and |V| related?

== Zipf's law ==

=== Zipf's law ([Gelbukh, Sidorov, 2001]) ===

In any large enough text, the frequency ranks (starting from the highest)
of types are inversely proportional to the corresponding frequencies:
f = 1/r

''f'' — frequency of a type;

''r'' — rank of a type (its position in the list of all types in order of their frequency of occurrence).

[[Файл:L2 p1.jpg|мини|Zipf's law: example]]

== Heaps' law ==

Heaps' law ([Gelbukh, Sidorov, 2001])

The number of different types in a text is roughly proportional to an exponent of its size:
[[Файл:L2 p2.jpg|мини|слева]]

''N'' = number of tokens;

|V| = size of vocabulary (i.e. number of types);

''K'', ''b'' — free parameters, K ∈ [10; 100]; b ∈ [0.4; 0.6]

== Why tokenization is difficult? ==

*Easy example: "Good muffins cost $3.88 in New York. Please buy me two of them. Thanks."
** is \." a token?
** is $3.88 a single token?
** is \New York" a single token?
*Real data may contain noise in it: code, markup, URLs, faulty punctuation
* Real data contains misspellings: "an dthen she aksed"
*Period "." does not always mean the end of sentence: m.p.h., PhD.

Nevertheless tokenization is important for all other text processing steps.
There are rule-based and machine learning-based approaches to development of tokenizers.

== Rule-based tokenization ==

For example, define a token as a sequence of upper and lower case letters: A-Za-z. Reqular expression is a nice tool for programming such rules.

===RE in Python===

<code>
In[1]: import re

In[2]: prog = re.compile('[A-Za-z]+')

In[3]: prog.findall("Words, words, words.")

Out[1]: ['Words', 'words', 'words']
</code>

== Sentence segmentation ==

What are the sentence boundaries?
* ?, ! are usually unambiguous
* Period "." is an issue
* Direct speech is also an issue: She said, "What time will you be home?" and I said, "I don't know!". Even worse in Russian!

Let us learn a classifier for sentence segmentation.

===Binary classifier ===

A binary classifier f : X ⇒ 0; 1 takes input data ''X'' (a set of sentences) and decides EndOfSentence (0) or NotEndOfSentence (1).

What can be the features for classification? I am a period, am I EndOfSentence?
* Lots of blanks after me?
* Lots of lower case letters and ? or ! after me?
* Do I belong to abbreviation?
* etc.

We need a lot of hand-markup.

== Natural Language Toolkit (NLTK) ==

Do we need to program this?
No! There is Natural Language Toolkit (NLTK) for everything.

'''NLTK tokenizers'''

<code>
In[1]: from nltk.tokenize import RegexpTokenizer,
wordpunct tokenize

In[2]: s = 'Good muffins cost $3.88 in New York. Please
buy me two of them. Thanks.'

In[3]: tokenizer = RegexpTokenizer('\w+ | \$ [\d \.]+ | S \+')

In[4]: tokenizer.tokenize(s)

In[5]: wordpunct tokenize(s)
</code>

=== Learning to tokenize ===

nltk.tokenize.punkt is a tool for learning to tokenize from your data. It includes pre-trained Punkt tokenizer for English.

==== Punkt tokenizer ====
<code>
In[1]: import nltk.data

In[2]: sent detector = nltk.data.load('tokenizers/punkt/english.pickle')

In[3]: sent detector.tokenize(s)

</code>

== Exercise 1.1 Word counts ==

Input: Alice in Wonderland (alice.txt) or your text

Output 1: number of tokens

Output 2: number of types

Use nltk.FreqDist() to count types. nltk.FreqDist() is a frequency
dictionary: [key, frequency(key)].

== Lemmatization (Normalization) ==

Each word has a base form:
* has, had, have ⇒ have
* cats, cat, cat's ⇒ cat
* Windows ⇒ window or Windows?

=== Lemmatization [Jurafsky, Martin, 1999] ===
Lemmatization (or normalization) is used to reduce in ections or variant forms to base forms. A dictionary with headwords is required.

=== Lemmatization ===

<code>

In[1]: from nltk.stem import WordNetLemmatizer

In[2]: lemmatizer = WordNetLemmatizer()

In[3]: lemmatizer.lemmatize(t)

</code>

== Stemming ==

A word is built with morphems: ''word = stem + affixes''. Sometimes we do not need affixes.

translate, translation, translator ⇒ translat

=== Stemming [Jurafsky, Martin, 1999] ===

Reduce terms to their stems in information retrieval and text classification. Porter's algorithm is the most common English stemmer.

=== Stemming ===

<code>
In[1]: from nltk.stem.porter import PorterStemmer

In[2]: stemmer = PorterStemmer()

In[3]: stemmer.stem(t)
</code>

== Exercise 1.2 Word counts (continued) ==

Input: Alice in Wonderland (alice.txt) or your text

Output 1: 20 most common lemmata

Output 2: 20 most common stems

Use FreqDist() to count lemmata and stems. Use FreqDist().most common() to find most common lemmata and stems.

== Exercise 1.3 Do we need all words? ==

Stopword is a not meaningful word: prepositions, adjunctions, pronouns, articles, etc.

'''Stopwords'''
<code>

In[1]: from nltk.corpus import stopwords

In[2]: print stopwords.words('english')

</code>

'''Exercise 1.3'''

Input: Alice in Wonderland (alice.txt) or your text

Output 1: 20 most common lemmata without stop words

Use not in operator to exclude not stopwords in a cycle.

Lecture 2. Tokenization and word counts - История изменений

imported>Katya: Migrated current public revision from wiki.cs.hse.ru