Introduction to NATURAL LANGUAGE PROCESSING (NLP) Dr. Resmi N.G. Assistant Professor, CSE, MITS Day 1 - Session 2
Image Sources:towardsdatascience.com, medium.com, stock.adobe.com Let machines talk!!! Let machines read!!! Let machines listen!!! Let machines feel!!! 2
Can we survive without NLP???
Natural Language Language that has developed in the usual way as a method of communicating between people, rather than language that has been created, for example for computers. Source: https://dictionary.cambridge.org/dictionary/english/natural-language , https://www.ethnologue.com/guides/how-many-languages 7117 languages are spoken today!!! 4
Distribution of Languages on Internet Websites https://commons.wikimedia.org/wiki/ File:2014_Distribution_of_Languages_on_Interne t_Websites.jpg 5
• Humans have been writing things down for thousands of years and it would be really helpful if a computer could read and understand all that data. Panini’s grammar of Sanskrit was written over two thousand years ago and is still referenced today in teaching Sanskrit. (Source: https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf) 6
Digitization ??? All of us have been part of Google’s this digitization process!!! More than 13 million articles from The New York Times dating from 1851 to the present day and many books that were too illegible to be scanned by computers have been digitized as well as translated into different languages!!! Source: https://blog.goodaudience.com/how-we-all-helped- unknowingly-google-to-digitize-books-acb45bc65084
Source: https://www.ultimateedgecommunications.com.au/blog/what- happens-in-a-single-internet-minute/ Unstructured data 8
What is NLP? • Natural Language Processing - the sub-field of AI focused on enabling computers to understand and process human languages. • Aims to improve human-computer interaction. • Involves computational processing of natural languages. 9
Natural Language Processing Source: https://datascience.foundation/sciencewhitepaper/natural-language-processing-nlp-simplified-a-step-by-step-guide AI: Artificial Intelligence ML: Machine Learning DL: Deep Learning 10
NLP is divided into two fields • Linguistics • Scientific study of language. • Involves analysis of language form, language meaning, and language in context, as well as an analysis of the social, cultural, historical, and political factors that influence language. • Computer Science Source: commons.wikimedia.org 11
Components of NLP https://data-flair.training/blogs/nlp-tutorial-natural-language-processing/ 12
Steps in NLP Source: https://datascience.foundation/sciencewhitepaper/natural-language-processing-nlp-simplified-a-step-by-step-guide 13
NLP Pipeline https://medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e 14
BuildinganNLPPipeline Step-by-Step • Step 1: Sentence Segmentation • Text: Hi, I am Sophia. I am a humanoid. I can recognize people and converse with them. • S1: Hi, I am Sophia. • S2: I am a humanoid. • S3: I can recognize people and converse with them. 15
• Step 2: Word Tokenization • Word tokenization refers to breaking of a sentence into separate words or tokens. • Text: The boy’s name was Santiago. • Tokenized: ‘The’, ‘boy’s’, ‘name’, ‘was’, ‘Santiago’. 16
• Step 3: Parts of Speech (POS) Tagging • Identify and tag each token whether it is a noun, a verb, an adjective or so on. • Text: The quick brown fox jumps over the lazy dog. 17
Source: https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf 18
• Step 4: Text Lemmatization • Figuring out the most basic form or lemma of each word in the sentence. • Text : The dog is chasing a cat. • Lemmatized text: The dog be chase a cat. 19
• Step 5: Identifying Stop Words • Words that appear very frequently like “and”, “the”, and “a”. 20
• Step 6: Dependency Parsing • Figure out how all the words in our sentence relate to each other. This is called dependency parsing. • The goal is to build a tree that assigns a single parent word to each word in the sentence. The root of the tree will be the main verb in the sentence. 21
• Finding Noun Phrases • Using the information from the dependency parse tree, words that are all talking about the same thing can be grouped together Source: medium.com 22
• Syntax parsing rules Source: https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf 23
Source: https://www.nltk.org/book/ch08.html 24
• Step 7: Named Entity Recognition (NER) • The goal of Named Entity Recognition, or NER, is to detect and label the nouns with the real-world concepts that they represent. • Names of persons • Company names • Geographic locations (Both physical and political) • Product names • Dates and times • Amounts of money • Names of events 25
Source: https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf 26
Source: https://towardsdatascience.com/extend-named-entity-recogniser-ner-to-label-new-entities- with-spacy-339ee5979044 27
• Step 8: Coreference Resolution • Task of finding all expressions that refer to the same entity in a text. It is an important step for a lot of higher level NLP tasks that involve natural language understanding such as document summarization, question answering, and information extraction. Source: https://nlp.stanford.edu 28
Challenges in NLP 29
NLP is HARD!!! • Goal: Deep understanding • Reality: Shallow Matching Source: http://www.cs.cmu.edu/afs/cs/user/tbergkir/www/11711fa17/FA17%2011-711%20lecture%201%20--%20introduction.pdf 30
Text is SUPERFICIAL 31
32
AMBIGUITY at all the levels 33
Phonological Ambiguity • I scream or Icecream • Too, two, or to 34
Morphological Ambiguity • watchdogs = watch + dogs or watchdog+s • Unionized = un + ionized or union + ized 35
Word Sense Ambiguity Source: https://www.thoughtco.com/syntactic- ambiguity-grammar-1692179 36
37
Syntactic Ambiguity Source: https://languagelog.ldc.upenn.edu/nll/?p=17711 38
Syntactic Ambiguity 39
Humans apply commonsense !!! Computers lack commonsense knowledge!! Punctuation Ambiguity 40
Pronoun Resolution Source: https://www.printwand.com/blog/8-catastrophic-examples-of-word-choice-mistakes 41
Source: https://techvidvan.com/tutorials/natural-language-processing-nlp/ 42
Other Applications • Paraphrase detection • Morphological analysis • Question answering • Text summarization • Emotion detection • Anaphora resolution • Author identification 43
A Review of the Recent History of NLP 2001 Neural Language Model 2013 Word embeddings 2014 Sequence-to- sequence models 2015 Attention 2017 Transformer 2018 Pretrained language models 2019 BERT 2020 GPT-3 (Generative Pre-Trained Transformer- 3) 44
• Neural Language Model • One hot encoding – curse of dimensionality • Solution – distributed representation of words • Vector representations 45 Source: scholarpedia.org
• Word Embeddings • Word representation that allows words with similar meaning to have a similar representation. • Word2Vec 46 Source: https://arxiv.org/pdf/1301.3781.pdf
• Sequence-to-sequence models • Natural language generation 47 Source: towardsdatascience.com
• Transformer 48 Source: towardsdatascience.com
Image source: https://analyticsindiamag.com/top-8-pre-trained-nlp-models-developers-must-know/ 49 Pre-trained language models
• BERT - Bidirectional Encoder Representations from Transformers • Pre-training – using unlabeled data • Fine-tuning – using labeled data 50 Source: https://arxiv.org/pdf/1810.04805.pdf
• GPT-3 • Generative Pre-Training (GPT) • Unsupervised pre-training • Supervised fine-tuning 51 Sources: https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf https://arxiv.org/pdf/2005.14165.pdf
Open Source NLP Tools https://medium.com/microsoftazure/7-amazing-open-source-nlp-tools-to-try-with-notebooks-in-2019- c9eec058d9f1 52
Toolboxes for NLP in Indian Languages • Natural Language Toolkit for Indic Languages (iNLTK) • https://inltk.readthedocs.io/en/ latest/ • NLP for Indic languages • https://indicnlp.org/#vision • Samsaadhanii • https://scl.samsaadhanii.in/scl/# 53
Datasets for NLP in Indian languages • Indian language dataset: https://github.com/goru001 • Universal dependencies • https://github.com/ UniversalDependencies • IIT hyderabad • Google dataset serach: https://datasetsearch.research.g oogle.com/ 54
Recent Research Areas • Neural Machine Translation • Text summarization • Multimodal sentiment analysis • Multi-modal Information Extraction • Question answering • Dialogue and interactive systems • Speech to speech translation • Fake news identification • Offensive language identification • Few-shot learning • Unsupervised text mining 55
Thank You 56

Introduction to Natural Language Processing - Stages in NLP Pipeline, Challenges in NLP, Ambiguities in NLP, Language Models, Tools, Frameworks and Datasets