Introduction to Natural Language Processing; Fall, 2010

English 3525 (also listed as CISC 2830 and LING 3023); section codes 3138, 3429, and 3023
Meets on Tuesdays and Thursdays, from 9:30 to 10:45 in 3404 Boylan Hall; 3 hours, 3 credits
Prerequisites: C.I.S. 1.5; prerequisites or co-requisites C.I.S. 11 and Linguistics 1.

Instructor: Rennie Gonsalves; Office: 1420 Ingersoll Hall Tel.: 718-951-5928
e-mail: renniegons@yahoo.com
Office Hours: Tuesdays and Thursdays, 12:30-1:30, in 1420 N

Course Description:
This class will introduce you to some of the basic elements of NLP. Using Natural Language Processing with Python by Bird, Klein, and Loper the course will focus mainly on processing natural language texts: like Moby Dick or Jane Eyre, or an article in the New York Times or the Wall Street Journal, or text from a blog, a tweet, or a website. We will also more briefly look at processing spoken language and for this we will use a few chapters from Peter Ladefoged's Elements of Acoustic Phonetics, Second Edition, and Introducing Speech and Language Processing by John Coleman. Class exercises, projects, and exams will emphasize programming in Python, though the Coleman text will take us a bit into C and Prolog. Our weekly assignments will involve writing short NLP programs, mostly in Python.
We will begin with an overview of Python and the Natural Language Toolkit (NLTK), which has Python-friendly software for natural language analysis, and a variety of literary and other texts with which to practice NLP programming. This will be essentially hands-on learning, looking at how small NLP programs work and creating some of them ourselves. The topics that we will cover include importing and analyzing text corpora and raw texts; writing structured Python programs; categorizing and tagging words; regular expression; n-grams; and analyzing sentence structure. Finally, using the Coleman text we will turn to speech synthesis and speech recognition. Topics covered here will include speech sound sampling and quantization, representing speech sounds using arrays of numbers, digital filters and resonators, finite state machines and transducers for phonological and syntactic analysis, architectures for speech recognition, and dynamic time warping. We will end the semester with presentations and discussions of student NLP projects.


Course Objectives:
1. Students will understand the structure of a grammar of a natural language, including its phonetics, phonology, syntax and semantics.
2. Students will be able to describe the major aspects of the acoustics of speech sounds and explain how these can be implemented. 3. Students will be able to use Python and the Natural Language Toolkit to access and analyze a variety of text corpora.
4. Students will be able to explain the functioning of Python implementations of small NLP projects. 5. Students will be able to write and explain the functioning of a Python implementation of a context-free grammar for a fragment of English.
6. Students will be able to work effectively as part of a team on a small NLP project.
7. Students will be able to write a report on an NLP system, clearly describing its design and operation.
8. Students will be able to deliver such a report orally to the rest of the class.


Required Texts (Available at the Brooklyn College Bookstore; 718 434-0333):

Bird, Steven, Ewan Klein & Edward Loper. Natural Language Processing with Python. O'Reilly Media Inc. 2009.
Coleman, John. Introducing Speech and Language Processing. Cambridge University Press, 2005.
Ladefoged, Peter. Elements of Acoustic Phonetics, Second Edition. The University of Chicago Press. 1996.



COURSE OUTLINE

Section 1: WEEKS 1-4: Language, Python, and the NLTK


The purpose of the course; some basic ideas in linguistics; overview of the components of a grammar; phonetics, phonology, morphology, syntax and semantics; introduction to using Python and the Natural Language Toolkit (NLTK); accessing and analyzing text corpora using Python and the NLTK; writing structured Python programs; N-grams and categorizing and tagging words.

Readings - Bird, Chapters 1 to 5.

********************

Section 2: WEEKS 5-7: Words, Sentences and Texts

Classifying texts; Naïve Bayes classifiers and decision trees; modeling linguistic patterns; extracting information from texts; chunking; analyzing sentence structure; parsing with context-free grammars; dependency and dependency grammars.

Readings - Bird, Chapters 6 to 8.

*********************

Section 3: WEEKS 8-10: Introduction to Acoustic Phonetics and Implementing a Cosine Wave

Introduction to acoustic phonetics; the basic features of sound waves: loudness, pitch, and quality; sound spectra. The production of speech; the formants and harmonics of vowel sounds; the sampling theorem; introduction to the mathematics of sound waves; sines, cosines and logarithms; using the Coleman software; introduction to C via the cosine wave program; using an array of numbers in C to implement a cosine wave; data types (char, int, short, long, float, double) in C.

Readings - Ladefoged, Chapters 1 to 4.
- Coleman, Chapters 1 and 2.

********************


Section 4: WEEKS 11-13: Speech Synthesis and Speech Recognition

Overview of root mean square amplitude, running means of 4, high pass and band-pass filters, and the Klatt formant synthesizer. Finite state automata and grammar; deterministic and non-deterministic FSAs; implementation of a non-deterministic FSA in Prolog to recognize some English monosyllabic phoneme strings; an FSA for Syntax; a student-designed implementation in Prolog? Outlines of an architecture for speech recognition; the knowledge-based and pattern-matching approaches to speech recognition, dynamic time warping, vector quantization.

Readings - Coleman, Chapters 3, 5, and 6.

***********************

Section 5: WEEK 14: Presentation of Student Projects

Participation:
Students must attend regularly, arrive on time and must be prepared to participate, having done the assigned work. Participation will count for 10% of the overall grade for the course.

Assignments:

Papers:
Students will write one term paper of 5-7 pages. This paper will be centered on your implementation in Python, C, or Prolog of some aspect of natural language processing. It will include the code and a detailed description of both the linguistic and the programming aspects of the project.

Exams:

At the end of each of the first two sections of the course there will be a take-home test based on the material in that section. At the end of the semester there will also be a final exam based on sections three and four of our syllabus.

Grading:

Grades will be based on the following percentages:
Class Participation--10%
Take-Home Tests --30%
Homework-10%
Term Paper-- 30 %
Final Exam-20%


Evaluation criteria for class participation:

Your class participation will be judged on the basis of your questions, your respect for other class members' and my points of view (as shown in the way you respond to others' ideas), and your attentiveness to the discussion (people who don't like to speak frequently will not be penalized, but you should make an effort to participate). I also expect that your participation will reflect your having done the reading and other homework for each class.

Evaluation criteria for written work:
From a list by Lewis Hyde, edited by Sue Lonoff, with thanks to Richard Marius's writing handbook.
The Unsatisfactory Paper.
The D or F paper either has no thesis or else it has one that is strikingly vague, broad, or uninteresting. There is little indication that the writer understands the material being presented. The paragraphs do not hold together; ideas do not develop from sentence to sentence. This paper usually repeats the same thoughts again and again, perhaps in slightly different language but often in the same words. The D or F paper is filled with mechanical faults, errors in grammar, and errors in spelling.
The C Paper.
The C paper has a thesis, but it is vague and broad, or else it is uninteresting or obvious. It does not advance an argument that anyone might care to debate. "Henry James wrote some interesting novels." "Modern cities are interesting places."
The thesis in the C paper often hangs on some personal opinion. If the writer is a recognized authority, such an expression of personal taste may be noteworthy, but writers gain authority not merely by expressing their tastes but by justifying them. Personal opinion is often the engine that drives an argument, but opinion by itself is never sufficient. It must be defended.
The C paper rarely uses evidence well; sometimes it does not use evidence at all. Even if it has a clear and interesting thesis, a paper with insufficient supporting evidence is a C paper.
The C paper often has mechanical faults, errors in grammar and spelling, but please note: a paper without such flaws may still be a C paper.
The B Paper.
The reader of a B paper knows exactly what the author wants to say. It is well organized, it presents a worthwhile and interesting idea, and the idea is supported by sound evidence presented in a neat and orderly way. Some of the sentences may not be elegant, but they are clear, and in them thought follows naturally on thought. The paragraphs may be unwieldy now and then, but they are organized around one main idea. The reader does not have to read a paragraph two or three times to get the thought that the writer is trying to convey.
The B paper is always mechanically correct. The spelling is good, and the punctuation is accurate. Above all, the paper makes sense throughout. It has a thesis that is limited and worth arguing. It does not contain unexpected digressions, and it ends by keeping the promise to argue and inform that the writer makes in the beginning.
The A Paper.
The A paper has all the good qualities of the B paper, but in addition it is lively, well paced, interesting, even exciting. The paper has style. Everything in it seems to fit the thesis exactly. It may have a proofreading error or two, or even a misspelled word, but the reader feels that these errors are the consequence of the normal accidents all good writers encounter. Reading the paper, we can feel a mind at work. We are convinced that the writer cares for his or her ideas, and about the language that carries them.
Copyright © 2002, 2003 by the President and Fellows of Harvard College. Permission is granted to non-profit educational institutions to reproduce this document for internal use provided that the Bok Center's authorship and copyright are acknowledged.


Bibliography:

Allen, James. Natural Language Understanding. The Benajmins/Cummings Publishing Company Inc. 1994.

Bird, Steven, Ewan Klein & Edward Loper. Natural Language Process with Python. O'Reilly Media Inc. 2009.

Carnie, Andrew. Syntax; A Generative Introduction; Second Edition. Blackwell, 2007.

Chomsky, Noam. Syntactic Structures. Mouton, 1971.

Clark, John, Collin Yallop, and Janet Fletcher. An Introduction to Phonetics and Phonology; Third Edition. Wiley-Blackwell, 2007.

Clocksin, William F. and Christopher S. Mellish. Programming In Prolog; Using the ISO Standard; Fifth Edition. Springer, 2003.

Coleman, John. Introducing Speech and Language Processing. Cambridge University Press, 2005.

Hausser, Roland. Foundations of Computational Linguistics; Human-Computer Communication in Natural Language. Springer, 2001.

Jackendoff, Ray. Foundations of Language; Brain, Meaning, Grammar, Evolution. Oxford University Press, 2003.

Johnson, Keith. Acoustic and Auditory Phonetics. Blackwell Publishing, 2006.

Jurafsky, Daniel, and James H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Prentice Hall, New Jersey, 2008.

Kernighan, Brian W. and Dennis M Ritchie. The C Programming Language; Second Edition. Prentice Hall, 1988.

Ladefoged, Peter. Elements of Acoustic Phonetics, Second Edition. The University of Chicago Press. 1996.

Lyons, Richard G. Understanding Digital Signal Processing; Second Edition. Prentice Hall, 2004.

O'Grady, William, et al. Contemporary Linguistics; An Introduction; Sixth Edition. Bedford/Saint Martin's Press, 2009.

Pereira, Fernando C. N. and Stuart M. Shieber. Prolog and Natural Language Analysis. Center for the Study of Language and Information, 1987.

Tarski, Alfred. Introduction to Logic and to the Methodology of Deductive Sciences. Dover Publication, New York, 1995.