<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="ru">
	<id>https://www.wikicshse.ru/index.php?action=history&amp;feed=atom&amp;title=Lecture_2._Tokenization_and_word_counts</id>
	<title>Lecture 2. Tokenization and word counts - История изменений</title>
	<link rel="self" type="application/atom+xml" href="https://www.wikicshse.ru/index.php?action=history&amp;feed=atom&amp;title=Lecture_2._Tokenization_and_word_counts"/>
	<link rel="alternate" type="text/html" href="https://www.wikicshse.ru/index.php?title=Lecture_2._Tokenization_and_word_counts&amp;action=history"/>
	<updated>2026-06-06T11:08:35Z</updated>
	<subtitle>История изменений этой страницы в вики</subtitle>
	<generator>MediaWiki 1.45.3</generator>
	<entry>
		<id>https://www.wikicshse.ru/index.php?title=Lecture_2._Tokenization_and_word_counts&amp;diff=429&amp;oldid=prev</id>
		<title>imported&gt;Katya: Migrated current public revision from wiki.cs.hse.ru</title>
		<link rel="alternate" type="text/html" href="https://www.wikicshse.ru/index.php?title=Lecture_2._Tokenization_and_word_counts&amp;diff=429&amp;oldid=prev"/>
		<updated>2016-11-05T19:48:54Z</updated>

		<summary type="html">&lt;p&gt;Migrated current public revision from wiki.cs.hse.ru&lt;/p&gt;
&lt;p&gt;&lt;b&gt;Новая страница&lt;/b&gt;&lt;/p&gt;&lt;div&gt;Ekaterina Chernyak, Dmitry Ilvovsky&lt;br /&gt;
&lt;br /&gt;
== How many words? ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;The rain in Spain stays mainly in the plain.&amp;quot;&lt;br /&gt;
9 &amp;#039;&amp;#039;&amp;#039;tokens&amp;#039;&amp;#039;&amp;#039;: The, rain, in, Spain, stays, mainly, in, the, plain&lt;br /&gt;
7 (or 8) &amp;#039;&amp;#039;&amp;#039;types&amp;#039;&amp;#039;&amp;#039;: T = the rain, in, Spain, stays, mainly, plain&lt;br /&gt;
&lt;br /&gt;
=== Type and token ===&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;Type&amp;#039;&amp;#039; is an element of the vocabulary.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;Token&amp;#039;&amp;#039; is an instance of that type in the text.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
N = number of tokens;&lt;br /&gt;
&lt;br /&gt;
V - vocabulary (i.e. all types);&lt;br /&gt;
&lt;br /&gt;
|V| = size of vocabulary (i.e. number of types).&lt;br /&gt;
&lt;br /&gt;
How are N and |V| related?&lt;br /&gt;
&lt;br /&gt;
== Zipf&amp;#039;s law ==&lt;br /&gt;
&lt;br /&gt;
=== Zipf&amp;#039;s law ([Gelbukh, Sidorov, 2001]) ===&lt;br /&gt;
&lt;br /&gt;
In any large enough text, the frequency ranks (starting from the highest)&lt;br /&gt;
of types are inversely proportional to the corresponding frequencies:&lt;br /&gt;
 f = 1/r&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;f&amp;#039;&amp;#039; — frequency of a type;&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;r&amp;#039;&amp;#039; — rank of a type (its position in the list of all types in order of their frequency of occurrence).&lt;br /&gt;
&lt;br /&gt;
[[Файл:L2 p1.jpg|мини|Zipf&amp;#039;s law: example]]&lt;br /&gt;
&lt;br /&gt;
== Heaps&amp;#039; law ==&lt;br /&gt;
&lt;br /&gt;
Heaps&amp;#039; law ([Gelbukh, Sidorov, 2001])&lt;br /&gt;
&lt;br /&gt;
The number of different types in a text is roughly proportional to an exponent of its size:&lt;br /&gt;
[[Файл:L2 p2.jpg|мини|слева]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;N&amp;#039;&amp;#039; = number of tokens;&lt;br /&gt;
&lt;br /&gt;
|V| = size of vocabulary (i.e. number of types);&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;K&amp;#039;&amp;#039;, &amp;#039;&amp;#039;b&amp;#039;&amp;#039; — free parameters,  K &amp;amp;isin; [10; 100]; b &amp;amp;isin; [0.4; 0.6]&lt;br /&gt;
&lt;br /&gt;
== Why tokenization is difficult? ==&lt;br /&gt;
&lt;br /&gt;
*Easy example: &amp;quot;Good muffins cost $3.88 in New York. Please buy me two of them. Thanks.&amp;quot;&lt;br /&gt;
** is \.&amp;quot; a token?&lt;br /&gt;
** is $3.88 a single token?&lt;br /&gt;
** is \New York&amp;quot; a single token?&lt;br /&gt;
*Real data may contain noise in it: code, markup, URLs, faulty punctuation&lt;br /&gt;
* Real data contains misspellings: &amp;quot;an dthen she aksed&amp;quot;&lt;br /&gt;
*Period &amp;quot;.&amp;quot; does not always mean the end of sentence: m.p.h., PhD.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Nevertheless tokenization is important for all other text processing steps.&lt;br /&gt;
There are rule-based and machine learning-based approaches to development of tokenizers.&lt;br /&gt;
&lt;br /&gt;
== Rule-based tokenization ==&lt;br /&gt;
&lt;br /&gt;
For example, define a token as a sequence of upper and lower case letters: A-Za-z. Reqular expression is a nice tool for programming such rules.&lt;br /&gt;
&lt;br /&gt;
===RE in Python===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&lt;br /&gt;
In[1]: import re&lt;br /&gt;
&lt;br /&gt;
In[2]: prog = re.compile(&amp;#039;[A-Za-z]+&amp;#039;)&lt;br /&gt;
&lt;br /&gt;
In[3]: prog.findall(&amp;quot;Words, words, words.&amp;quot;)&lt;br /&gt;
&lt;br /&gt;
Out[1]: [&amp;#039;Words&amp;#039;, &amp;#039;words&amp;#039;, &amp;#039;words&amp;#039;]&lt;br /&gt;
&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Sentence segmentation ==&lt;br /&gt;
&lt;br /&gt;
What are the sentence boundaries?&lt;br /&gt;
* ?, ! are usually unambiguous&lt;br /&gt;
* Period &amp;quot;.&amp;quot; is an issue&lt;br /&gt;
* Direct speech is also an issue: She said, &amp;quot;What time will you be home?&amp;quot; and I said, &amp;quot;I don&amp;#039;t know!&amp;quot;. Even worse in Russian!&lt;br /&gt;
&lt;br /&gt;
Let us learn a classifier for sentence segmentation.&lt;br /&gt;
&lt;br /&gt;
===Binary classifier ===&lt;br /&gt;
&lt;br /&gt;
A binary classifier f : X &amp;amp;rArr; 0; 1  takes input data &amp;#039;&amp;#039;X&amp;#039;&amp;#039; (a set of sentences) and decides EndOfSentence (0) or NotEndOfSentence (1).&lt;br /&gt;
&lt;br /&gt;
What can be the features for classification? I am a period, am I EndOfSentence?&lt;br /&gt;
* Lots of blanks after me?&lt;br /&gt;
* Lots of lower case letters and ? or ! after me?&lt;br /&gt;
* Do I belong to abbreviation?&lt;br /&gt;
* etc.&lt;br /&gt;
&lt;br /&gt;
We need a lot of hand-markup.&lt;br /&gt;
&lt;br /&gt;
== Natural Language Toolkit (NLTK) ==&lt;br /&gt;
&lt;br /&gt;
Do we need to program this?&lt;br /&gt;
No! There is Natural Language Toolkit (NLTK) for everything.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;NLTK tokenizers&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&lt;br /&gt;
In[1]: from nltk.tokenize import RegexpTokenizer,&lt;br /&gt;
wordpunct tokenize&lt;br /&gt;
&lt;br /&gt;
In[2]: s = &amp;#039;Good muffins cost $3.88 in New York. Please&lt;br /&gt;
buy me two of them. Thanks.&amp;#039;&lt;br /&gt;
&lt;br /&gt;
In[3]: tokenizer = RegexpTokenizer(&amp;#039;\w+ | \$ [\d \.]+ | S \+&amp;#039;)&lt;br /&gt;
&lt;br /&gt;
In[4]: tokenizer.tokenize(s)&lt;br /&gt;
&lt;br /&gt;
In[5]: wordpunct tokenize(s)&lt;br /&gt;
&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Learning to tokenize ===&lt;br /&gt;
&lt;br /&gt;
nltk.tokenize.punkt is a tool for learning to tokenize from your data. It includes pre-trained Punkt tokenizer for English.&lt;br /&gt;
&lt;br /&gt;
==== Punkt tokenizer ====&lt;br /&gt;
&amp;lt;code&amp;gt;&lt;br /&gt;
In[1]: import nltk.data&lt;br /&gt;
&lt;br /&gt;
In[2]: sent detector = nltk.data.load(&amp;#039;tokenizers/punkt/english.pickle&amp;#039;)&lt;br /&gt;
&lt;br /&gt;
In[3]: sent detector.tokenize(s)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Exercise 1.1 Word counts ==&lt;br /&gt;
&lt;br /&gt;
Input: Alice in Wonderland (alice.txt) or your text&lt;br /&gt;
&lt;br /&gt;
Output 1: number of tokens&lt;br /&gt;
&lt;br /&gt;
Output 2: number of types&lt;br /&gt;
&lt;br /&gt;
Use nltk.FreqDist() to count types. nltk.FreqDist() is a frequency&lt;br /&gt;
dictionary: [key, frequency(key)].&lt;br /&gt;
&lt;br /&gt;
== Lemmatization (Normalization) ==&lt;br /&gt;
&lt;br /&gt;
Each word has a base form:&lt;br /&gt;
* has, had, have &amp;amp;rArr; have&lt;br /&gt;
* cats, cat, cat&amp;#039;s &amp;amp;rArr; cat&lt;br /&gt;
* Windows &amp;amp;rArr; window or Windows?&lt;br /&gt;
&lt;br /&gt;
=== Lemmatization [Jurafsky, Martin, 1999] ===&lt;br /&gt;
Lemmatization (or normalization) is used to reduce in ections or variant forms to base forms. A dictionary with headwords is required.&lt;br /&gt;
&lt;br /&gt;
=== Lemmatization ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In[1]: from nltk.stem import WordNetLemmatizer&lt;br /&gt;
&lt;br /&gt;
In[2]: lemmatizer = WordNetLemmatizer()&lt;br /&gt;
&lt;br /&gt;
In[3]: lemmatizer.lemmatize(t)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Stemming ==&lt;br /&gt;
&lt;br /&gt;
A word is built with morphems: &amp;#039;&amp;#039;word = stem + affixes&amp;#039;&amp;#039;. Sometimes we do not need affixes.&lt;br /&gt;
&lt;br /&gt;
translate, translation, translator &amp;amp;rArr; translat&lt;br /&gt;
&lt;br /&gt;
=== Stemming [Jurafsky, Martin, 1999] ===&lt;br /&gt;
&lt;br /&gt;
Reduce terms to their stems in information retrieval and text classification. Porter&amp;#039;s algorithm is the most common English stemmer.&lt;br /&gt;
&lt;br /&gt;
=== Stemming ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&lt;br /&gt;
In[1]: from nltk.stem.porter import PorterStemmer&lt;br /&gt;
&lt;br /&gt;
In[2]: stemmer = PorterStemmer()&lt;br /&gt;
&lt;br /&gt;
In[3]: stemmer.stem(t)&lt;br /&gt;
&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Exercise 1.2 Word counts (continued) ==&lt;br /&gt;
&lt;br /&gt;
Input: Alice in Wonderland (alice.txt) or your text &lt;br /&gt;
&lt;br /&gt;
Output 1: 20 most common lemmata&lt;br /&gt;
&lt;br /&gt;
Output 2: 20 most common stems&lt;br /&gt;
&lt;br /&gt;
Use FreqDist() to count lemmata and stems. Use FreqDist().most common() to find most common lemmata and stems.&lt;br /&gt;
&lt;br /&gt;
== Exercise 1.3 Do we need all words? ==&lt;br /&gt;
&lt;br /&gt;
Stopword is a not meaningful word: prepositions, adjunctions, pronouns, articles, etc.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Stopwords&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&amp;lt;code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In[1]: from nltk.corpus import stopwords&lt;br /&gt;
&lt;br /&gt;
In[2]: print stopwords.words(&amp;#039;english&amp;#039;)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Exercise 1.3&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
Input: Alice in Wonderland (alice.txt) or your text&lt;br /&gt;
&lt;br /&gt;
Output 1: 20 most common lemmata without stop words&lt;br /&gt;
&lt;br /&gt;
Use not in operator to exclude not stopwords in a cycle.&lt;/div&gt;</summary>
		<author><name>imported&gt;Katya</name></author>
	</entry>
</feed>