What is natural language processing (NLP)? (original) (raw)

Natural language processing (NLP) is the ability of a computer program to understand human language as it's spoken and written -- referred to as natural language. It's a component of artificial intelligence (AI).

NLP has existed for more than 50 years and has roots in the field of linguistics. It has a variety of real-world applications in numerous fields, including medical research, search engines and business intelligence.

NLP uses either rule-based or machine learning approaches to understand the structure and meaning of text. It plays a role in chatbots, voice assistants, text-based scanning programs, translation applications and enterprise software that aids in business operations, increases productivity and simplifies different processes.

How does natural language processing work?

NLP uses many different techniques to enable computers to understand natural language as humans do. Whether the language is spoken or written, natural language processing can use AI to take real-world input, process it and make sense of it in a way a computer can understand. Just as humans have different sensors -- such as ears to hear and eyes to see -- computers have programs to read and microphones to collect audio. And just as humans have a brain to process that input, computers have a program to process their respective inputs. At some point in processing, the input is converted to code that the computer can understand.

There are two main phases to natural language processing: data preprocessing and algorithm development.

Data preprocessing involves preparing and cleaning text data so that machines can analyze it. Preprocessing puts data in a workable form and highlights features in the text that an algorithm can work with. There are several ways this can be done, including the following:

Once the data has been preprocessed, an algorithm is developed to process it. There are many different natural language processing algorithms, but the following two main types are commonly used:

Why is natural language processing important?

Businesses use large amounts of unstructured, text-heavy data and need a way to efficiently process it. Much of the information created online and stored in databases is natural human language, and until recently, businesses couldn't effectively analyze this data. This is where natural language processing is useful.

The advantages of natural language processing can be seen when considering the following two statements: "Cloud computing insurance should be part of every service-level agreement" and "A good SLA ensures an easier night's sleep -- even in the cloud." If a user relies on natural language processing for search, the program will recognize that cloud computing is an entity, that cloud is an abbreviated form of cloud computing, and that SLA is an industry acronym for service-level agreement.

These are the types of vague elements that frequently appear in human language and that machine learning algorithms have historically been bad at interpreting. Now, with improvements in deep learning and machine learning methods, algorithms can effectively interpret them. These improvements expand the breadth and depth of data that can be analyzed.

Likewise, NLP is useful for the same reasons as when a person interacts with a generative AI chatbot or AI voice assistant. Instead of needing to use specific predefined language, a user could interact with a voice assistant like Siri on their phone using their regular diction, and their voice assistant will still be able to understand them.

A chart showing how natural language processing can be used.

These are some of the key areas in which a business can use NLP.

Techniques and methods of natural language processing

Syntax and semantic analysis are two main techniques used in natural language processing.

Syntax is the arrangement of words in a sentence to make grammatical sense. NLP uses syntax to assess meaning from a language based on grammatical rules. Syntax NLP techniques include the following:

Parsing

This is the grammatical analysis of a sentence. For example, a natural language processing algorithm is fed the sentence, "The dog barked." Parsing involves breaking this sentence into parts of speech -- i.e., dog = noun, barked = verb. This is useful for more complex downstream processing tasks.

Word segmentation

This is the act of taking a string of text and deriving word forms from it. For example, a person scans a handwritten document into a computer. The algorithm can analyze the page and recognize that the words are divided by white spaces.

Sentence breaking

This places sentence boundaries in large texts. For example, a natural language processing algorithm is fed the text, "The dog barked. I woke up." The algorithm can use sentence breaking to recognize the period that splits up the sentences.

Morphological segmentation

This divides words into smaller parts called morphemes. For example, the word untestably would be broken into [[un[[test]able]]ly], where the algorithm recognizes "un," "test," "able" and "ly" as morphemes. This is especially useful in machine translation and speech recognition.

Stemming

This divides words with inflection in them into root forms. For example, in the sentence, "The dog barked," the algorithm would recognize the root of the word "barked" is "bark." This is useful if a user is analyzing text for all instances of the word bark, as well as all its conjugations. The algorithm can see that they're essentially the same word even though the letters are different.

Semantics involves the use of and meaning behind words. Natural language processing applies algorithms to understand the meaning and structure of sentences. Semantic techniques include the following:

Word sense disambiguation

This derives the meaning of a word based on context. For example, consider the sentence, "The pig is in the pen." The word pen has different meanings. An algorithm using this method can understand that the use of the word here refers to a fenced-in area, not a writing instrument.

Named entity recognition (NER)

NER determines words that can be categorized into groups. For example, an algorithm using this method could analyze a news article and identify all mentions of a certain company or product. Using the semantics of the text, it could differentiate between entities that are visually the same. For instance, in the sentence, "Daniel McDonald's son went to McDonald's and ordered a Happy Meal," the algorithm could recognize the two instances of "McDonald's" as two separate entities -- one a restaurant and one a person.

Natural language generation (NLG)

NLG uses a database to determine the semantics behind words and generate new text. For example, an algorithm could automatically write a summary of findings from a business intelligence (BI) platform, mapping certain words and phrases to features of the data in the BI platform. Another example would be automatically generating news articles or tweets based on a certain body of text used for training.

Current approaches to natural language processing are based on deep learning, a type of AI that examines and uses patterns in data to improve a program's understanding. Deep learning models require massive amounts of labeled data for the natural language processing algorithm to train on and identify relevant correlations, and assembling this kind of big data set is one of the main hurdles to natural language processing.

Earlier approaches to natural language processing involved a more rule-based approach, where simpler machine learning algorithms were told what words and phrases to look for in text and given specific responses when those phrases appeared. But deep learning is a more flexible, intuitive approach in which algorithms learn to identify speakers' intent from many examples -- almost like how a child would learn human language.

Three open source tools commonly used for natural language processing include Natural Language Toolkit (NLTK), Gensim and NLP Architect by Intel. NLTK is a Python module with data sets and tutorials. Gensim is a Python library for topic modeling and document indexing. NLP Architect by Intel is a Python library for deep learning topologies and techniques.

Some of the main functions and NLP tasks that natural language processing algorithms perform include the following:

The functions listed above are used in a variety of real-world applications, including the following:

Benefits of natural language processing

The main benefit of NLP is that it improves the way humans and computers communicate with each other. The most direct way to manipulate a computer is through code -- the computer's language. Enabling computers to understand human language makes interacting with computers much more intuitive for humans.

Other benefits include the following:

Challenges of natural language processing

There are numerous challenges in natural language processing, and most of them boil down to the fact that natural language is ever-evolving and somewhat ambiguous. They include the following:

The evolution of natural language processing

NLP draws from a variety of disciplines, including computer science and computational linguistics developments dating back to the mid-20th century. Its evolution included the following major milestones:

1950s

Natural language processing has its roots in this decade, when Alan Turing developed the Turing Test to determine whether or not a computer is truly intelligent. The test involves automated interpretation and the generation of natural language as a criterion of intelligence.

1950s-1990s

NLP was largely rules-based, using handcrafted rules developed by linguists to determine how computers would process language. The Georgetown-IBM experiment in 1954 became a notable demonstration of machine translation, automatically translating more than 60 sentences from Russian to English. The 1980s and 1990s saw the development of rule-based parsing, morphology, semantics and other forms of natural language understanding.

1990s

The top-down, language-first approach to natural language processing was replaced with a more statistical approach because advancements in computing made this a more efficient way of developing NLP technology. Computers were becoming faster and could be used to develop rules based on linguistic statistics without a linguist creating all the rules. Data-driven natural language processing became mainstream during this decade. Natural language processing shifted from a linguist-based approach to an engineer-based approach, drawing on a wider variety of scientific disciplines instead of delving into linguistics.

2000-2020s

Natural language processing saw dramatic growth in popularity as a term. NLP processes using unsupervised and semi-supervised machine learning algorithms were also explored. With advances in computing power, natural language processing has also gained numerous real-world applications. NLP also began powering other applications like chatbots and virtual assistants. Today, approaches to NLP involve a combination of classical linguistics and statistical methods.

Natural language processing plays a vital part in technology and the way humans interact with it. Though it has its challenges, NLP is expected to become more accurate with more sophisticated models, more accessible and more relevant in numerous industries. NLP will continue to be an important part of both industry and everyday life.

As natural language processing is making significant strides in new fields, it's becoming more important for developers to learn how it works. Learn how to develop your skills in creating NLP programs.

This was last updated in August 2024

Continue Reading About What is natural language processing (NLP)?

Dig Deeper on AI technologies