Merge pull request #772 from shashank3959/notebook-update

Update DLE READMEs
This commit is contained in:
nv-kkudrynski 2021-01-08 23:48:27 +01:00 committed by GitHub
commit 3f9cdc6290
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
52 changed files with 1146 additions and 0 deletions

View file

@ -0,0 +1,56 @@
# Image Classification
Image classification is the task of categorizing an image into one of several predefined classes, often also giving a probability of the input belonging to a certain class. This task is crucial in understanding and analyzing images, and it comes quite effortlessly to human beings with our complex visual systems. Most powerful image classification models today are built using some form of Convolution Neural Networks (CNNs), which are also the backbone of many other tasks in Computer Vision.
![What is Image Classification?](img/1_image-classification-figure-1.PNG)
[Source](https://github.com/NVlabs/stylegan)
In this overview, we will cover
- Types of image Classification
- How does it work?
- How is the performance evaluated?
- Use cases and applications
- Where to get started
---
## Types of image Classification
Image Classification can be broadly divided into either Binary or Multi-class problems depending on the number of categories. Binary image classification problems entail predicting one of two classes. An example of this would be to predict whether an image is that of a dog or not. A subtly different problem is that of single-class (one vs all) classification, where the goal is to recognize data from one class and reject all other. This is beneficial when there is an overabundance of data from one of the classes, also called a class imbalance.
![Input and Outputs for Image Classification](img/1_image-classification-figure-2.PNG)
In Multi-class classification problems, models categorize instances into one of three or more categories. Multi-class models often also return confidence scores (or probabilities) of an image belonging to each of the possible classes. This should not be confused with multi-label classification, where a model assigns multiple labels to an instance.
---
## How does it work?
In recent years, Convolutional Neural Networks (CNNs) have led the way to massive breakthroughs in Computer Vision. Most state-of-the-art Image Classification models today employ CNNs in some form. Convolutional Layers are the building blocks of CNNs, and similar to Neural Networks they are composed of neurons that learn parameters like weights and biases. Most CNNs are composed of many Convolutional layers that work like feature extractors, and coupled with Fully Connected (FC) layers they learn to identify patterns in images to return confidence scores in different categories.
But what makes Convolutional Networks special? Well, CNNs are built with the assumption that input is in the form of images, and exploiting this fact they can be vastly more efficient than a standard Neural Network for a given level of performance.
![Typical CNN architecture](img/1_image-classification-figure-3.PNG)
Network depth (number of layers) and the number of learnable parameters have been found to be of crucial importance in performance. Top models can typically have over a hundred layers and hundreds of millions of parameters. Much of recent research in visual recognition has been focused around “network engineering”, i.e. designing better architectures, even employing Machine Learning algorithms to search for one, such as in the case of Neural Architecture Search.
---
## How is the performance evaluated?
Image Classification performance is often reported as Top-1 or Top-5 scores. In top-1 score, classification is considered correct if the top predicted class (with the highest predicted probability) matches the true class for a given instance. In top-5, we check if one of the top 5 predictions matches the true class. The score is just the number of correct predictions divided by the total number of instances evaluated.
---
## Use cases and applications
### Categorizing Images in Large Visual Databases
Businesses with visual databases may accumulate large amounts of images with missing tags or meta-data. Unless there is an effective way to organize such images, they may not be much use at all. On the contrary, they may hog precious storage space. Automated image classification algorithms can classify such untagged images into predefined categories. Businesses can avoid expensive manual labor by employing automated image classification algorithms.
A related task is that of Image Organization in smart devices like mobile phones. With Image Classification techniques, images and videos can be organized for improved accessibility.
### Visual Search
Visual Search or Image-based search has risen to popularity over the recent years. Many prominent search engines already provide this feature where users can search for visual content similar to a provided image. This has many applications in the e-commerce and retail industry where users can take a snap and upload an image of a product they are interested in purchasing. This makes the shopping experience much more efficient for customers, and can increase sales for businesses.
### Healthcare
Medical Imaging is about creating visual images of internal body parts for clinical purposes. This includes health monitoring, medical diagnosis, treatment, and keeping organized records. Image Classification algorithms can play a crucial role in Medical Imaging by assisting medical professionals detect presence of illness and having consistency in clinical diagnosis.
---
## Where to get started?
In this Collection, you will find state-of-the-art implementations of Image Classification models and their containers. A good place to get started with Image Classification is with the [ResNet-50](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Classification/ConvNets/resnet50v1.5) model.
ResNets (Residual Networks) are very popular Convolutional Neural Network architectures built with blocks utilizing skip connections to jump over some layers. As the name suggests, ResNet-50 is a variant that is 50 layers deep! But why do we need these “skip” connections? As it turns out building better CNN architectures is not as simple as stacking more and more layers. In practice, If we just keep adding depth to a CNN, at some point the performance stagnates or may start getting worse. Very deep networks are notoriously difficult to train, because of the vanishing gradient problem. In simpler terms, as the depth increases, repeated multiplications during back-propagation may end up making the gradient vanishingly small. This may prevent weights from changing. In ResNets, the skip connects are meant to act like a “gradient superhighway” allowing the gradient to flow unrestrained thus alleviating the problem of the vanishing gradients. ResNets were very influential in the development of subsequent Convolutional Network architectures, and there is much more to them than the brief summary above!

Binary file not shown.

After

Width:  |  Height:  |  Size: 294 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 112 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 105 KiB

View file

@ -0,0 +1,51 @@
# Object Detection
A natural progression from image classification would be classification and localization of the subject of the image. We can take this idea one step further and localize objects in a given image. Simply put, object detection refers to identifying which object(s) are there in an image.
![](img/2_object-detection-figure-1.png)
Source: [Joseph Redmon, Ali Farhadi, “YOLO9000:Better, Faster, Stronger”](https://arxiv.org/abs/1612.08242)
## Introduction to Object Detection
In this section we will try to answer the following questions:
- What is object detection?
- Why is object detection important?
Object Detection is about not only detecting the presence and location of objects in images and videos, but also categorizing them into everyday objects. Oftentimes, there is a confusion between Image Classification and Object Detection. Simply put, the difference between them is the same as the difference between saying “This is a cat” and pointing to a cat and saying “There is the cat”.
To build autonomous systems, perception is the main challenge to be solved. Perception, in terms of autonomous systems refers to the ability of understanding the surroundings of the autonomous agent. This means that the agent needs to be able to figure out where and what objects are in its immediate vicinity.
Object detection can help keep humans away from toxic environments and hazardous situations. Challenges like garbage segregation, oil rig monitoring, nightly surveillance, cargo port maintenance and other high risk applications can be aided by robots/cameras which can detect objects. Essentially, any environment that requires visual inspection or analysis and is too dangerous for humans, object detection pipelines can be used to shield from any onsite hazard.
## How does it work?
While this has been a topic of research since before Deep Learning became mainstream, the best performing models today use one or more Deep Neural Networks.
Many architectures have networks pretrained on a different, simpler task, like Image Classification. As one can imagine, the inputs to this task can be images or videos, and the outputs are usually a set of bounding box coordinates that enclose each of the detected objects, as well as a class label for each detected object. With advances in research and the use of GPUs, it is possible to have object detection in real time with really impressive accuracies!
![](img/2_object-detection-figure-2.png)
Source: [Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg, “SSD: Single Shot MultiBox Detector”](https://arxiv.org/abs/1512.02325)
Single Shot Detector(SSD) is one of the state-of-the-art models for object detection and localization. It is based on a feed-forward convolutional neural network which always yields a fixed set of bounding boxes and a confidence score which represents how confident the network is about the bounding box containing an object. This is followed by a non maximum suppression step which outputs the final detections.
This network can be understood as two networks stacked on top of each other. The first network is a simple convolutional neural network which “extracts important features” which is the same as the image classification networks.
The second network is a multiscale feature map network built using another set of convolutional layers which are progressively smaller in size to allow detections on multiple scales. Simply put, the progressively smaller layers help detect objects of different sizes. Each layer in this set of layers outputs a number of detections and the final layer passes the output to a non maxima suppression which yields a final set of detections.
This Collection contains models and containers for object detection achieving state-of-the-art accuracies, tested and maintained by Nvidia.
## Applications and Use cases
### Autonomous Vehicles
Autonomous vehicles need to perceive and interact with real world objects in order to blend in with the environment. For instance a self-driving car needs to detect other vehicles, pedestrians, objects on the road, traffic signals and any and all obstacles on road and also understand the exact location of these objects. This perception information helps the agent avoid obstacles and understand how to interact with objects like traffic lights.
### Warehouses
Warehouses have many conveyor belts and segregation platforms. These tasks have traditionally been handled manually. As factories and warehouses scale, manually sorting and managing inventory cannot be scaled proportionally. Object detection pipelines deployed on robots can reduce operational friction and enable easy scale up solutions for businesses.
### Surveillance
Surveillance systems typically accumulate large volumes of video data which needs to be analyzed for all sorts of anomalies. Given the number of video sources even a small store has, analysing surveillance data from a large operation is a challenge. Object detection networks can help automate much of the pipeline to highlight sections where there is an object of interest. It can also be trained to identify anomalies in video streams.
### Hazardous tasks
Humans work at waste processing plants, nuclear power plants, oil rigs and around heavy machinery, which tend to be extremely hazardous and dangerous which pose health risks. These tasks essentially require human presence for visual tasks and confirmations which revolve around recognizing objects and relaying locations of objects. Risky tasks like these can be completed with a help of a object detection pipeline deployed on a camera or a robot which can reduce operational risks and costs.

Binary file not shown.

After

Width:  |  Height:  |  Size: 593 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 66 KiB

View file

@ -0,0 +1,93 @@
# Language Modeling
Language modeling (LM) is a natural language processing (NLP) task that determines the probability of a given sequence of words occurring in a sentence.
In an era where computers, smartphones and other electronic devices increasingly need to interact with humans, language modeling has become an indispensable technique for teaching devices how to communicate in natural languages in human-like ways.
But how does language modeling work? And what can you build with it? What are the different approaches, what are its potential benefits and limitations, and how might you use it in your business?
In this guide, youll find answers to all of those questions and more. Whether youre an experienced machine learning engineer considering implementation, a developer wanting to learn more, or a product manager looking to explore whats possible with natural language processing and language modeling, this guide is for you.
Heres a look at what well cover:
- Language modeling the basics
- How does language modeling work?
- Use cases and applications
- Getting started
## Language modeling the basics
### What is language modeling?
"*Language modeling is the task of assigning a probability to sentences in a language. […]
Besides assigning a probability to each sequence of words, the language models also assign a
probability for the likelihood of a given word (or a sequence of words) to follow a sequence
of words.*" Source: Page 105, [Neural Network Methods in Natural Language Processing](http://amzn.to/2wt1nzv), 2017.
### Types of language models
There are primarily two types of Language Models:
- Statistical Language Models: These models use traditional statistical techniques like N-grams, Hidden Markov Models (HMM), and certain linguistic rules to learn the probability distribution of words.
- Neural Language Models: They use different kinds of Neural Networks to model language, and have surpassed the statistical language models in their effectiveness.
"*We provide ample empirical evidence to suggest that connectionist language models are
superior to standard n-gram techniques, except their high computational (training)
complexity.*" Source: [Recurrent neural network based language model](http://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf), 2010.
Given the superior performance of neural language models, we include in the container two popular state-of-the-art neural language models: BERT and Transformer-XL.
### Why is language modeling important?
Language modeling is fundamental in modern NLP applications. It enables machines to understand qualitative information, and enables people to communicate with machines in the natural languages that humans use to communicate with each other.
Language modeling is used directly in a variety of industries, including tech, finance, healthcare, transportation, legal, military, government, and more -- actually, you probably have just interacted with a language model today, whether it be through Google search, engaging with a voice assistant, or using text autocomplete features.
## How does language modeling work?
The roots of modern language modeling can be traced back to 1948, when Claude Shannon
published a paper titled "A Mathematical Theory of Communication", laying the foundation for information theory and language modeling. In the paper, Shannon detailed the use of a stochastic model called the Markov chain to create a statistical model for the sequences of letters in English text. The Markov models, along with n-gram, are still among the most popular statistical language models today.
However, simple statistical language models have serious drawbacks in scalability and fluency because of its sparse representation of language. Overcoming the problem by representing language units (eg. words, characters) as a non-linear, distributed combination of weights in continuous space, neural language models can learn to approximate words without being misled by rare or unknown values.
Therefore, as mentioned above, we introduce two popular state-of-the-art neural language models, BERT and Transformer-XL, in Tensorflow and PyTorch. More details can be found in the [NVIDIA Deep Learning Examples Github Repository ](https://github.com/NVIDIA/DeepLearningExamples)
## Use cases and applications
### Speech Recognition
Imagine speaking a phrase to the phone, expecting it to convert the speech to text. How does
it know if you said "recognize speech" or "wreck a nice beach"? Language models help figure it out
based on the context, enabling machines to process and make sense of speech audio.
### Spelling Correction
Language-models-enabled spellcheckers can point to spelling errors and possibly suggest alternatives.
### Machine translation
Imagine you are translating the Chinese sentence "我在开车" into English. Your translation system gives you several choices:
- I at open car
- me at open car
- I at drive
- me at drive
- I am driving
- me am driving
A language model tells you which translation sounds the most natural.
## Getting started
NVIDIA provides examples for Language Modeling on [Deep Learning Examples Github Repository](https://github.com/NVIDIA/DeepLearningExamples). These examples provide you with easy to consume and highly optimized scripts for both training and inferencing. The quick start guide at our GitHub repository will help you in setting up the environment using NGC Docker Images, download pre-trained models from NGC and adapt the model training and inference for your application/use-case.
These models are tested and maintained by NVIDIA, leveraging mixed precision using tensor cores on our latest GPUs for faster training times while maintaining accuracy.

View file

@ -0,0 +1,65 @@
# Recommender Systems
Recommender systems are a type of information filtering system that seeks to predict the
"rating" or "preference" a user would give to an item. (Source:
[Wikipedia](https://en.wikipedia.org/wiki/Recommender_system))
In an era where users have to navigate through an exponentially growing number of goods and services, recommender systems have become key in driving user engagement, teaching the internet services how to personalize experiences for users. They are ubiquitous and indispensable in commercial online platforms.
In this guide, youll find answers to how recommender systems work, how you might use it in your business, and more. Whether youre an experienced machine learning engineer considering implementation, a developer wanting to learn more, or a product manager looking to explore whats possible with recommender systems, this guide is for you.
Here is a look at what we will cover:
- Challenges and opportunities in recommender systems
- How does DL-based recommender systems work?
- Use cases and applications
## Challenges and opportunities in recommender systems
With the rapid growth in scale of industry datasets, deep learning (DL) recommender models have started to gain advantages over traditional methods by capitalizing on large amounts of training data. However, there are multiple challenges when it comes to performance of large-scale recommender systems solutions:
- Huge datasets: Commercial recommenders are trained on huge datasets, often several terabytes in scale.
- Complex data preprocessing and feature engineering pipelines: Datasets need to be preprocessed and transformed into a form relevant to be used with DL models and frameworks. In addition, feature engineering creates an extensive set of new features from existing ones, requiring multiple iterations to arrive at an optimal solution.
- Input bottleneck: Data loading, if not well optimized, can be the slowest part of the training process, leading to under-utilization of high-throughput computing devices such as GPUs.
- Extensive repeated experimentation: The whole data engineering, training, and evaluation process is generally repeated many times, requiring significant time and computational resources.
To meet the computational demands for large-scale DL recommender systems training and inference, recommender-on-GPU solutions aim to provide fast feature engineering and high training throughput (to enable both fast experimentation and production retraining), as well as low latency, high-throughput inference.
Current DLbased models for recommender systems include the [Wide and
Deep](https://arxiv.org/abs/1606.07792) model, Deep Learning Recommendation Model
([DLRM](https://github.com/facebookresearch/dlrm)), neural collaborative filtering
([NCF](https://arxiv.org/abs/1708.05031)), Variational Autoencoder
([VAE](https://arxiv.org/abs/1802.05814)) for Collaborative Filtering, and
[BERT4Rec](https://arxiv.org/pdf/1904.06690.pdf), among others.
## How does DL-based recommender systems work?
In [NVIDIA Deep Learning Examples](https://github.com/NVIDIA/DeepLearningExamples), we introduce several popular state-of-the-art DL-based recommender models in Tensorflow and PyTorch.
As an example, we would like to start with discussing our reference implementation of DLRM. With DLRM, we systematically tackle the challenges mentioned by designing a complete DLRM pipeline, from data preparation to training to production inference. We provide ready-to-go Docker images for training and inference, data downloading and preprocessing tools, and Jupyter demo notebooks to get you started quickly. Also, trained models can be prepared for production inference in one simple step with our exporter tool.
For more details on the model architectures, example code, and how to set to end-to-end data processing, training, and inference pipeline on GPU, please refer to the [DLRM developer blog](https://developer.nvidia.com/blog/optimizing-dlrm-on-nvidia-gpus/) and [NVIDIA GPU-accelerated DL model portfolio ](https://github.com/NVIDIA/DeepLearningExamples) under /PyTorch/Recommendation/DLRM.
In addition, DLRM forms part of NVIDIA [Merlin](https://developer.nvidia.com/nvidia-merlin), a framework for building high-performance, DLbased recommender systems.
## Use cases and applications
### E-Commerce & Retail: Personalized Merchandising
Imagine a user has already purchased a scarf. Why not offer buying a hat that matches this hat, so that the look will be complete? This feature is often implemented by means of AI-based algorithms as “Complete the look” or “You might also like” sections in e-commerce platforms like Amazon, Walmart, Target, and many others.
On average, an intelligent recommender systems delivers a [22.66% lift in conversions rates](https://brandcdn.exacttarget.com/sites/exacttarget/files/deliverables/etmc-predictiveintelligencebenchmarkreport.pdf) for web products.
### Media & Entertainment: Personalized Content
AI based recommender engines can analyze the individual purchase behavior and detect patterns that will help provide a certain user with the content suggestions that will match his or her interests most likely. This is what Google and Facebook actively apply when recommending ads, or what Netflix does behind the scenes when recommending movies and TV shows.
### Personalized Banking
A mass market product that is consumed digitally by millions, banking is prime for recommendations. Knowing a customers detailed financial situation and their past preferences, coupled by data of thousands of similar users, is quite powerful.

View file

@ -0,0 +1,97 @@
# Segmentation
Image Segmentation is the field of image processing that deals with separating the image into multiple subgroups or regions (such as pixels set, also known as image segments) representing distinctive objects or its subparts.
Nowadays, we are constantly making interpretations of the world around us through cameras and other devices. Therefore image segmentation has become an integral part of our lives, since it's an indispensable technique for teaching the devices how to process this interpretation, how to understand the world around them.
In this collection, we will cover:
- What is image segmentation?
- Types of image segmentation
- How does image segmentation work?
- Use-cases and applications
- Where to get started
---
## What is image segmentation?
Image segmentation is a computer vision process by which a digital image is divided into various categories or segments. We use this method to understand what is depicted using a pixel-wise classification of the image. It is very much distinct from image classification, which allots labels to an entire image; object detection identifies and locates objects within an image by drawing bounding boxes around them. Image segmentation presents more pixel-level knowledge about the image content.
Consider a road side scenario with pedestrians, cars and lights:
![](img/3_image-segmentation-figure-1.png)
This photo is made up of an immense number of individual pixels, and image segmentation aims to assign each of those pixels to the object to which it belongs. Segmentation of an image enables us to segregate the foreground from the background, identify a road or a car's precise location, and mark the margins that separate a pedestrian from a car or road.
---
## Types of image segmentation
Image segmentation tasks can be broken down into two broad categories: semantic segmentation and instance segmentation.
1. Semantic segmentation:- This is the process of classifying each pixel belonging to a particular label. It doesn't different across different instances of the same object. For example if there are 2 cats in an image, semantic segmentation gives same label to all the pixels of both cats
2. Instance segmentation:- This differs from semantic segmentation in the sense that it gives a unique label to every instance of a particular object in the image. As can be seen in the image above all 3 dogs are assigned different colours i.e different labels. With semantic segmentation all of them would have been assigned the same colour.
---
## How does image segmentation work?
Let's consider image segmentation as a function.
An image is given as input to the function and it gives a matrix or a mask as the output, where each element tells us which class or instance that pixel belongs to.
Machine learning moves towards image segmentation train models to recognize which features of an image are crucial, rather than designing bespoke heuristics by hand.
Although deep neural networks architectures for image segmentation may differ in implementation, most follows similar basis structure:
![](img/3_image-segmentation-figure-2.png)
Source - [SegNet Paper](https://arxiv.org/pdf/1511.00561.pdf)
- The encoder: A set of layers that extract features of an image through a sequence of progressively narrower and deeper filters. Oftentimes, the encoder is pre-trained on a different task (like image recognition), where it learns statistical correlations from many images and may transfer that knowledge for the purposes of segmentation.
- The Decoder: A set of layers that progressively grows the output of the encoder into a segmentation mask resembling the pixel resolution of the input image.
- Skip connections: Long range connections in the neural network that allow the model to draw on features at varying spatial scales to improve model accuracy.
Most of the architectures used for segmentation tasks are built on the technique of Fully Convolutional Network (FCN) i.e., the architecture contains convolution layers instead of any Dense or Max Pool layers. Though various models support the FCN technique, a few handpicked models generally used in production are - UNet, MaskRCNN, and DeepLabv3.
---
## Use-cases and applications
Image Segmentation can be useful for a lot of different use-cases - handwriting recognition, virtual try-on, visual image search, road scene segmentation, organ segmentation and much more. Here are the few applications explained in detail:
### Autonomous vehicles:
There are a lot of things that needs your attention while driving- the road, other vehicles, pedestrians, sidewalks, and (potentially) a plethora of other potential obstacles/safety hazards.
If youve been driving for a long time, noticing and reacting to this environment might seem automatic or like second nature. In case of a self driving car, it would be a quick observation that this car needs to see, interpret, and respond to a scene in real-time. This implies the need to create pixel-level map of the world through the camera system in this vehicle in order to navigate it safely and efficiently.
Even though the field of autonomous machines/automobiles is much more complex than performing segmentation, this pixel-level understanding is a essential ingredient in a step towards reality.
![](img/3_image-segmentation-figure-3.png)
### Medical imaging and diagnostics:
In the initial steps of a diagnostic and treatment pipeline for many conditions that require medical images, such as CT or MRI scans, image segmentation can be used as a powerful technique.
Essentially, segmentation can effectively distinguish and separate homogeneous areas that may include particularly important pixels of organs, lesions, etc. However, there are significant challenges, including low contrast, noise, and various other imaging ambiguities.
![](img/3_image-segmentation-figure-4.png)
### Virtual try-on:
Virtual try on clothes is quite a fascinating feature which was available in stores using specialized hardware which creates a 3d model. But interestingly with deep learning and image segmentation the same can be obtained using just a 2d image.
![](img/3_image-segmentation-figure-5.png)
---
## Where to get started
NVIDIA provides Deep Learning Examples for Image Segmentation on its GitHub repository. These examples provide you with easy to consume and highly optimized scripts for both training and inferencing. The quick start guide at our GitHub repository will help you in setting up the environment using NGC Docker Images, download pre-trained models from NGC and adapt the model training and inference for your application/use-case.
Here are the examples relevant for image segmentation, directly from [Deep Learning Examples](https://github.com/NVIDIA/DeepLearningExamples):
1. 3D UNet for Medical Image Segmentation using Tensorflow 1.x
- [Git repository](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/Segmentation/UNet_3D_Medical)
- Uses TensorFlow 20.06-tf1-py3 [NGC container](https://ngc.nvidia.com/registry/nvidia-tensorflow)
2. 2D UNet for Industrial Defect Segmentation using Tensorflow 1.x
- [Git repository](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/Segmentation/UNet_Industrial)
- Uses TensorFlow 20.06-tf1-py3 [NGC container](https://ngc.nvidia.com/registry/nvidia-tensorflow)
3. MaskRCNN for Common Objects Segmentation using PyTorch
- [Git repository](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Segmentation/MaskRCNN)
- Uses PyTorch 20.06-py3 [NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)

Binary file not shown.

After

Width:  |  Height:  |  Size: 60 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 87 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 374 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 43 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 414 KiB

View file

@ -0,0 +1,62 @@
# Speech-to-text
Giving voice commands to an interactive virtual assistant, converting audio to subtitles on a video online, and transcribing customer interactions into text for archiving at a call center are all use cases for Automatic Speech Recognition (ASR) systems. With deep learning, the latest speech-to-text models are capable of recognition and translation of audio into text in real time! Good models can perform well in noisy environments, are robust to accents and have low word error rates (WERs).
![](img/8_speech-to-text-figure-1.png)
In this collection, we will cover:
- How does speech-to-text work?
- Usecases and applications
- Where to get started
---
## How does speech-to-text work?
![](img/8_speech-to-text-figure-2.png)
Source: https://developer.nvidia.com/blog/how-to-build-domain-specific-automatic-speech-recognition-models-on-gpus/
Speech to text is a challenging process, as it introduces a series of tasks which are as follows-
### Feature extraction:
Initially we resample the raw analog audio signals into convert into the discrete form following with some traditional signal preprocessing techniques such as standardization, windowing and conversion to a machine-understandable form by spectrogram transformation.
### Acoustic Modelling:
Acoustic models can be of various types and with different loss functions but the most used in literature and production are Connectionist Temporal Classification (CTC) based model that considers spectrogram (X) as input and produces the log probability scores (P) of all different vocabulary tokens for each time step. For example, NVIDIAs Jasper and QuartzNet.
![](img/8_speech-to-text-figure-3.png)
### Language Modeling:
It is used to add contextual representation about the language and finally correct the acoustic model's mistakes. It tries to determine the context of speech by combining the knowledge from acoustic model what it understands with calculating the probability distribution over sequence for next possible word.
---
## How does speech-to-text work?
### Automatic Transcription in Online Meetings/Conferences:
Maintaining notes during meetings is sometimes crucial and hectic. We are habitual to small errors, and we may get some distractions throughout a meeting which means that the notes we take arent always accurate and are generally incomplete. By Keeping digital transcribes of calls, your team will not only be able to share their conversations efficiently, but also understands the customer requirements, agenda and technical aspects behind the meeting.
### Captioning & Subtitling on Digital Platforms:
Useful in providing communication access to the students and professionals for media sessions, and live lectures with easy-to-read transcripts and captions containing precise grammar, proper punctuation, and accurate spelling. Moreover, this technique also improves reach and accessibility of education to deaf or hard-of-hearing audience.
### Documentation at medical facilities:
Medical doctors and clinicians can avail this technique in the respective field to proficiently digitize physician-patient conversations into text for entry into health record systems. The model will be trained to understand medical terminologies. This technique enables the practitioners to focus more on patient care than documentation while listening to them.
---
## Where to get started
NVIDIA provides Deep Learning Examples for Image Segmentation on its GitHub repository. These examples provide you with easy to consume and highly optimized scripts for both training and inferencing. The quick start guide at our GitHub repository will help you in setting up the environment using NGC Docker Images, download pre-trained models from NGC and adapt the model training and inference for your application/use-case.
Here are the examples relevant for image segmentation, directly from [Deep Learning Examples](https://github.com/NVIDIA/DeepLearningExamples):
1. Jasper on Librispeech for English ASR using PyTorch
- [Git repository](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechRecognition/Jasper)
- Uses PyTorch 20.06-py3 [NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
2. Kaldi ASR integrated with TRITON Inference Server
- [Git repository](https://github.com/NVIDIA/DeepLearningExamples/tree/master/Kaldi/SpeechRecognition)
- Uses Triton 19.12-py3 [NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:tritonserver)

Binary file not shown.

After

Width:  |  Height:  |  Size: 12 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 47 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 71 KiB

View file

@ -0,0 +1,54 @@
# Text-to-speech
Speech Synthesis or Text-to-Speech is the task of artificially producing human speech from a raw transcripts. With deep learning today, the synthesized waveforms can sound very natural, almost undistinguishable from how a human would speak. Such Text-to-Speech models can be used in cases like when an interactive virtual assistants responds, or when a mobile device converts the text on a webpage to speech for accessibility reasons.
In this collection, we will cover:
- How does speech-to-text work?
- Usecases and applications
- Where to get started
---
## How does speech-to-text work?
![](img/9_text-to-speech-figure-1.png)
TTS synthesis is a 2-step process described as follows:
1. Text to Spectrogram Model:
This model Transforms the text into time-aligned features such as spectrogram, mel spectrogram, or F0 frequencies and other linguistic features. We use architectures like Tacotron
2. Spectrogram to Audio Model:
Converts generated spectrogram time-aligned representation into continuous human-like audio—for example, WaveGlow.
![](img/9_text-to-speech-figure-2.png)
---
## Use Cases and applications
### Telecommunications and Multimedia:
E-mail services have become very prevalent in this decade. However, it is sometimes challenging to understand and read those important messages when being abroad. The lack of proper computer systems or some security problems may arise. With TTS technology, e-mail messages can listen quickly and efficiently on smartphones, adding to productivity.
### Voice assistant for Visually Impaired, Vocally Handicapped:
- Possibly, TTS's most useful and vital application is the reading of printed or non-braille texts for visually impaired/blind.
- This process also helps vocally handicapped people find difficulties communicating with others who do not understand sign language.
### Voice Assistant:
- Modern home appliances such as refrigerators can adopt this use case for reading cooking recipes.
- Automobiles for voice navigation to the destination spot.
- Easy teaching of pronunciation, phonetics of humongous difficult natural multi-lingual texts.
---
## Where to get started
NVIDIA provides Deep Learning Examples for Image Segmentation on its GitHub repository. These examples provide you with easy to consume and highly optimized scripts for both training and inferencing. The quick start guide at our GitHub repository will help you in setting up the environment using NGC Docker Images, download pre-trained models from NGC and adapt the model training and inference for your application/use-case.
Here are the examples relevant for image segmentation, directly from [Deep Learning Examples](https://github.com/NVIDIA/DeepLearningExamples):
1. Tacotron2 and WaveGlow for Speech Synthesis using PyTorch
- [Git repository](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2)
- Uses PyTorch 20.03-py3 [NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
2. FastPitch for text to melspectogram generation using PyTorch
- [Git repository](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch)
- Uses PyTorch 20.03-py3 [NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)

Binary file not shown.

After

Width:  |  Height:  |  Size: 80 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 163 KiB

View file

@ -0,0 +1,58 @@
# Machine Translation
Machine Translation is the task of translation text from one language to another. Simply replacing one word with it's equivalent in another language rarely produces a semantically meaningful translation, because that may not account for the phrase-level meaning at all. A good machine translation system may require modeling whole sentences or phrases. Use of Neural Networks has allowed end-to-end architectures that can accomplish this, mapping from input text to the corresponding output text.A good model should be able to handle challenges like morphologically rich languages and very large vocalbularies well, while maintaining reasonable training and inference times. This Collection contains state-of-the-art models and containers that can help with the task of Machine Translation.
In this collection, we will cover:
- Challenges in Machine Translation
- Model architecture
- Where to get started
---
## Challenges in Machine Translation
Ages before, it was very time consuming to translate the text from an unfamiliar language. Adopting simple vocabularies with word-for-word translation was challenging for two purposes: 1) the user had to know the grammar rules, and 2) must keep in mind all language transcriptions while translating the whole sentence.
Presently, we don't need to struggle so much we can translate phrases, sentences, and even large texts just by putting them in Google Translate.
If the Google Translator tried to keep the translations for even short sentences, it wouldn't work because of the massive number of possible variations. The most useful approach can be to train the machine sets of grammar rules and translate them accordingly. If only it were as easy as it sounds.
Suppose you have ever tried discovering a foreign language. In that case, you comprehend that there are always many exceptions to rules when we try to capture all these rules, limitations, and exceptions to the program's peculiarities, the quality of translation fragments down.
---
## Model architecture
i) Googles Neural Machine Translation:
Sequence-to-Sequence (seq2seq) models are used for several Natural Language Processing (NLP) jobs, such as text summarization, speech recognition, and nucleotide sequence modeling. We aim to translate the provided sentences from one language to another.
Here, both the input and output are sentences. In another way, these sentences are a sequence of words proceeding in and out of the network. It is the fundamental purpose of Sequence-to-Sequence modeling. The figure underneath tries to demonstrate this technique.
![Basic Architecture](img/6_machine-translation-figure-1.png)
Source - https://developer.nvidia.com/blog/introduction-neural-machine-translation-with-gpus/
The GNMT v2 model is related to the one addressed in [Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation](https://arxiv.org/abs/1609.08144) paper.
The most crucial difference between the two models is in the attention mechanism. In the version2 (v2) model, the decoder's output from the primary LSTM layer goes into the attention module. The re-weighted setting is then concatenated with inputs to all subsequent LSTM layers in the decoder at the present step.
![Basic Architecture](img/6_machine-translation-figure-2.png)
ii) Transformer based Neural Machine Translation:
The Transformer model uses typical NMT encoder-decoder architecture. Unlike other NMT models, this method uses no repeated contacts and works on a rigid-sized context windowpane. The encoder stack is made up of N identical layers. The individual layer is composed of the subsequent sublayers: 1. Self-attention layer 2. Feedforward network (which is two fully-connected layers) Like the encoder stack, the decoder stack comprises N identical layers. Each layer is composed of the sublayers: 1. Self-attention, layer 2. Multi-headed attention layer merging encoder outputs with events from the previous self-attention layer. 3. Feedforward network (2 fully-connected layers)
The encoder uses self-attention to calculate a representation of the input sequence. The decoder generates the output sequence one token at a time, taking the encoder output and former decoder-outputted tickets as inputs. The model also applies embeddings on the input and output tokens and adds a fixed positional encoding. The positional encoding adds knowledge about the location of each token.
![Basic Architecture](img/6_machine-translation-figure-3.png)
Source - [Attention is all you Need](https://arxiv.org/abs/1706.03762)
---
## Where to get started
NVIDIA provides Deep Learning Examples for Image Segmentation on its GitHub repository. These examples provide you with easy to consume and highly optimized scripts for both training and inferencing. The quick start guide at our GitHub repository will help you in setting up the environment using NGC Docker Images, download pre-trained models from NGC and adapt the model training and inference for your application/use-case.
Here are the examples relevant for image segmentation, directly from [Deep Learning Examples](https://github.com/NVIDIA/DeepLearningExamples):
1. Machine translation with GNMT using PyTorch
- [Git repository](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/Translation/GNMT)
- Uses TensorFlow 20.06-tf1-py3 [NGC container](https://ngc.nvidia.com/registry/nvidia-tensorflow)
2. Machine translation with Transformers using PyTorch
- [Git repository](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Translation/Transformer)
- Uses PyTorch 20.03-py3 [NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)

Binary file not shown.

After

Width:  |  Height:  |  Size: 17 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 164 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 172 KiB

View file

@ -0,0 +1,56 @@
# Image Classification
Image classification is the task of categorizing an image into one of several predefined classes, often also giving a probability of the input belonging to a certain class. This task is crucial in understanding and analyzing images, and it comes quite effortlessly to human beings with our complex visual systems. Most powerful image classification models today are built using some form of Convolution Neural Networks (CNNs), which are also the backbone of many other tasks in Computer Vision.
![What is Image Classification?](img/1_image-classification-figure-1.PNG)
[Source](https://github.com/NVlabs/stylegan)
In this overview, we will cover
- Types of image Classification
- How does it work?
- How is the performance evaluated?
- Use cases and applications
- Where to get started
---
## Types of image Classification
Image Classification can be broadly divided into either Binary or Multi-class problems depending on the number of categories. Binary image classification problems entail predicting one of two classes. An example of this would be to predict whether an image is that of a dog or not. A subtly different problem is that of single-class (one vs all) classification, where the goal is to recognize data from one class and reject all other. This is beneficial when there is an overabundance of data from one of the classes, also called a class imbalance.
![Input and Outputs for Image Classification](img/1_image-classification-figure-2.PNG)
In Multi-class classification problems, models categorize instances into one of three or more categories. Multi-class models often also return confidence scores (or probabilities) of an image belonging to each of the possible classes. This should not be confused with multi-label classification, where a model assigns multiple labels to an instance.
---
## How does it work?
In recent years, Convolutional Neural Networks (CNNs) have led the way to massive breakthroughs in Computer Vision. Most state-of-the-art Image Classification models today employ CNNs in some form. Convolutional Layers are the building blocks of CNNs, and similar to Neural Networks they are composed of neurons that learn parameters like weights and biases. Most CNNs are composed of many Convolutional layers that work like feature extractors, and coupled with Fully Connected (FC) layers they learn to identify patterns in images to return confidence scores in different categories.
But what makes Convolutional Networks special? Well, CNNs are built with the assumption that input is in the form of images, and exploiting this fact they can be vastly more efficient than a standard Neural Network for a given level of performance.
![Typical CNN architecture](img/1_image-classification-figure-3.PNG)
Network depth (number of layers) and the number of learnable parameters have been found to be of crucial importance in performance. Top models can typically have over a hundred layers and hundreds of millions of parameters. Much of recent research in visual recognition has been focused around “network engineering”, i.e. designing better architectures, even employing Machine Learning algorithms to search for one, such as in the case of Neural Architecture Search.
---
## How is the performance evaluated?
Image Classification performance is often reported as Top-1 or Top-5 scores. In top-1 score, classification is considered correct if the top predicted class (with the highest predicted probability) matches the true class for a given instance. In top-5, we check if one of the top 5 predictions matches the true class. The score is just the number of correct predictions divided by the total number of instances evaluated.
---
## Use cases and applications
### Categorizing Images in Large Visual Databases
Businesses with visual databases may accumulate large amounts of images with missing tags or meta-data. Unless there is an effective way to organize such images, they may not be much use at all. On the contrary, they may hog precious storage space. Automated image classification algorithms can classify such untagged images into predefined categories. Businesses can avoid expensive manual labor by employing automated image classification algorithms.
A related task is that of Image Organization in smart devices like mobile phones. With Image Classification techniques, images and videos can be organized for improved accessibility.
### Visual Search
Visual Search or Image-based search has risen to popularity over the recent years. Many prominent search engines already provide this feature where users can search for visual content similar to a provided image. This has many applications in the e-commerce and retail industry where users can take a snap and upload an image of a product they are interested in purchasing. This makes the shopping experience much more efficient for customers, and can increase sales for businesses.
### Healthcare
Medical Imaging is about creating visual images of internal body parts for clinical purposes. This includes health monitoring, medical diagnosis, treatment, and keeping organized records. Image Classification algorithms can play a crucial role in Medical Imaging by assisting medical professionals detect presence of illness and having consistency in clinical diagnosis.
---
## Where to get started?
In this Collection, you will find state-of-the-art implementations of Image Classification models and their containers. A good place to get started with Image Classification is with the [ResNet-50](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Classification/ConvNets/resnet50v1.5) model.
ResNets (Residual Networks) are very popular Convolutional Neural Network architectures built with blocks utilizing skip connections to jump over some layers. As the name suggests, ResNet-50 is a variant that is 50 layers deep! But why do we need these “skip” connections? As it turns out building better CNN architectures is not as simple as stacking more and more layers. In practice, If we just keep adding depth to a CNN, at some point the performance stagnates or may start getting worse. Very deep networks are notoriously difficult to train, because of the vanishing gradient problem. In simpler terms, as the depth increases, repeated multiplications during back-propagation may end up making the gradient vanishingly small. This may prevent weights from changing. In ResNets, the skip connects are meant to act like a “gradient superhighway” allowing the gradient to flow unrestrained thus alleviating the problem of the vanishing gradients. ResNets were very influential in the development of subsequent Convolutional Network architectures, and there is much more to them than the brief summary above!

Binary file not shown.

After

Width:  |  Height:  |  Size: 294 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 112 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 105 KiB

View file

@ -0,0 +1,51 @@
# Object Detection
A natural progression from image classification would be classification and localization of the subject of the image. We can take this idea one step further and localize objects in a given image. Simply put, object detection refers to identifying which object(s) are there in an image.
![](img/2_object-detection-figure-1.png)
Source: [Joseph Redmon, Ali Farhadi, “YOLO9000:Better, Faster, Stronger”](https://arxiv.org/abs/1612.08242)
## Introduction to Object Detection
In this section we will try to answer the following questions:
- What is object detection?
- Why is object detection important?
Object Detection is about not only detecting the presence and location of objects in images and videos, but also categorizing them into everyday objects. Oftentimes, there is a confusion between Image Classification and Object Detection. Simply put, the difference between them is the same as the difference between saying “This is a cat” and pointing to a cat and saying “There is the cat”.
To build autonomous systems, perception is the main challenge to be solved. Perception, in terms of autonomous systems refers to the ability of understanding the surroundings of the autonomous agent. This means that the agent needs to be able to figure out where and what objects are in its immediate vicinity.
Object detection can help keep humans away from toxic environments and hazardous situations. Challenges like garbage segregation, oil rig monitoring, nightly surveillance, cargo port maintenance and other high risk applications can be aided by robots/cameras which can detect objects. Essentially, any environment that requires visual inspection or analysis and is too dangerous for humans, object detection pipelines can be used to shield from any onsite hazard.
## How does it work?
While this has been a topic of research since before Deep Learning became mainstream, the best performing models today use one or more Deep Neural Networks.
Many architectures have networks pretrained on a different, simpler task, like Image Classification. As one can imagine, the inputs to this task can be images or videos, and the outputs are usually a set of bounding box coordinates that enclose each of the detected objects, as well as a class label for each detected object. With advances in research and the use of GPUs, it is possible to have object detection in real time with really impressive accuracies!
![](img/2_object-detection-figure-2.png)
Source: [Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg, “SSD: Single Shot MultiBox Detector”](https://arxiv.org/abs/1512.02325)
Single Shot Detector(SSD) is one of the state-of-the-art models for object detection and localization. It is based on a feed-forward convolutional neural network which always yields a fixed set of bounding boxes and a confidence score which represents how confident the network is about the bounding box containing an object. This is followed by a non maximum suppression step which outputs the final detections.
This network can be understood as two networks stacked on top of each other. The first network is a simple convolutional neural network which “extracts important features” which is the same as the image classification networks.
The second network is a multiscale feature map network built using another set of convolutional layers which are progressively smaller in size to allow detections on multiple scales. Simply put, the progressively smaller layers help detect objects of different sizes. Each layer in this set of layers outputs a number of detections and the final layer passes the output to a non maxima suppression which yields a final set of detections.
This Collection contains models and containers for object detection achieving state-of-the-art accuracies, tested and maintained by Nvidia.
## Applications and Use cases
### Autonomous Vehicles
Autonomous vehicles need to perceive and interact with real world objects in order to blend in with the environment. For instance a self-driving car needs to detect other vehicles, pedestrians, objects on the road, traffic signals and any and all obstacles on road and also understand the exact location of these objects. This perception information helps the agent avoid obstacles and understand how to interact with objects like traffic lights.
### Warehouses
Warehouses have many conveyor belts and segregation platforms. These tasks have traditionally been handled manually. As factories and warehouses scale, manually sorting and managing inventory cannot be scaled proportionally. Object detection pipelines deployed on robots can reduce operational friction and enable easy scale up solutions for businesses.
### Surveillance
Surveillance systems typically accumulate large volumes of video data which needs to be analyzed for all sorts of anomalies. Given the number of video sources even a small store has, analysing surveillance data from a large operation is a challenge. Object detection networks can help automate much of the pipeline to highlight sections where there is an object of interest. It can also be trained to identify anomalies in video streams.
### Hazardous tasks
Humans work at waste processing plants, nuclear power plants, oil rigs and around heavy machinery, which tend to be extremely hazardous and dangerous which pose health risks. These tasks essentially require human presence for visual tasks and confirmations which revolve around recognizing objects and relaying locations of objects. Risky tasks like these can be completed with a help of a object detection pipeline deployed on a camera or a robot which can reduce operational risks and costs.

Binary file not shown.

After

Width:  |  Height:  |  Size: 593 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 66 KiB

View file

@ -0,0 +1,93 @@
# Language Modeling
Language modeling (LM) is a natural language processing (NLP) task that determines the probability of a given sequence of words occurring in a sentence.
In an era where computers, smartphones and other electronic devices increasingly need to interact with humans, language modeling has become an indispensable technique for teaching devices how to communicate in natural languages in human-like ways.
But how does language modeling work? And what can you build with it? What are the different approaches, what are its potential benefits and limitations, and how might you use it in your business?
In this guide, youll find answers to all of those questions and more. Whether youre an experienced machine learning engineer considering implementation, a developer wanting to learn more, or a product manager looking to explore whats possible with natural language processing and language modeling, this guide is for you.
Heres a look at what well cover:
- Language modeling the basics
- How does language modeling work?
- Use cases and applications
- Getting started
## Language modeling the basics
### What is language modeling?
"*Language modeling is the task of assigning a probability to sentences in a language. […]
Besides assigning a probability to each sequence of words, the language models also assign a
probability for the likelihood of a given word (or a sequence of words) to follow a sequence
of words.*" Source: Page 105, [Neural Network Methods in Natural Language Processing](http://amzn.to/2wt1nzv), 2017.
### Types of language models
There are primarily two types of Language Models:
- Statistical Language Models: These models use traditional statistical techniques like N-grams, Hidden Markov Models (HMM), and certain linguistic rules to learn the probability distribution of words.
- Neural Language Models: They use different kinds of Neural Networks to model language, and have surpassed the statistical language models in their effectiveness.
"*We provide ample empirical evidence to suggest that connectionist language models are
superior to standard n-gram techniques, except their high computational (training)
complexity.*" Source: [Recurrent neural network based language model](http://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf), 2010.
Given the superior performance of neural language models, we include in the container two popular state-of-the-art neural language models: BERT and Transformer-XL.
### Why is language modeling important?
Language modeling is fundamental in modern NLP applications. It enables machines to understand qualitative information, and enables people to communicate with machines in the natural languages that humans use to communicate with each other.
Language modeling is used directly in a variety of industries, including tech, finance, healthcare, transportation, legal, military, government, and more -- actually, you probably have just interacted with a language model today, whether it be through Google search, engaging with a voice assistant, or using text autocomplete features.
## How does language modeling work?
The roots of modern language modeling can be traced back to 1948, when Claude Shannon
published a paper titled "A Mathematical Theory of Communication", laying the foundation for information theory and language modeling. In the paper, Shannon detailed the use of a stochastic model called the Markov chain to create a statistical model for the sequences of letters in English text. The Markov models, along with n-gram, are still among the most popular statistical language models today.
However, simple statistical language models have serious drawbacks in scalability and fluency because of its sparse representation of language. Overcoming the problem by representing language units (eg. words, characters) as a non-linear, distributed combination of weights in continuous space, neural language models can learn to approximate words without being misled by rare or unknown values.
Therefore, as mentioned above, we introduce two popular state-of-the-art neural language models, BERT and Transformer-XL, in Tensorflow and PyTorch. More details can be found in the [NVIDIA Deep Learning Examples Github Repository ](https://github.com/NVIDIA/DeepLearningExamples)
## Use cases and applications
### Speech Recognition
Imagine speaking a phrase to the phone, expecting it to convert the speech to text. How does
it know if you said "recognize speech" or "wreck a nice beach"? Language models help figure it out
based on the context, enabling machines to process and make sense of speech audio.
### Spelling Correction
Language-models-enabled spellcheckers can point to spelling errors and possibly suggest alternatives.
### Machine translation
Imagine you are translating the Chinese sentence "我在开车" into English. Your translation system gives you several choices:
- I at open car
- me at open car
- I at drive
- me at drive
- I am driving
- me am driving
A language model tells you which translation sounds the most natural.
## Getting started
NVIDIA provides examples for Language Modeling on [Deep Learning Examples Github Repository](https://github.com/NVIDIA/DeepLearningExamples). These examples provide you with easy to consume and highly optimized scripts for both training and inferencing. The quick start guide at our GitHub repository will help you in setting up the environment using NGC Docker Images, download pre-trained models from NGC and adapt the model training and inference for your application/use-case.
These models are tested and maintained by NVIDIA, leveraging mixed precision using tensor cores on our latest GPUs for faster training times while maintaining accuracy.

View file

@ -0,0 +1,65 @@
# Recommender Systems
Recommender systems are a type of information filtering system that seeks to predict the
"rating" or "preference" a user would give to an item. (Source:
[Wikipedia](https://en.wikipedia.org/wiki/Recommender_system))
In an era where users have to navigate through an exponentially growing number of goods and services, recommender systems have become key in driving user engagement, teaching the internet services how to personalize experiences for users. They are ubiquitous and indispensable in commercial online platforms.
In this guide, youll find answers to how recommender systems work, how you might use it in your business, and more. Whether youre an experienced machine learning engineer considering implementation, a developer wanting to learn more, or a product manager looking to explore whats possible with recommender systems, this guide is for you.
Here is a look at what we will cover:
- Challenges and opportunities in recommender systems
- How does DL-based recommender systems work?
- Use cases and applications
## Challenges and opportunities in recommender systems
With the rapid growth in scale of industry datasets, deep learning (DL) recommender models have started to gain advantages over traditional methods by capitalizing on large amounts of training data. However, there are multiple challenges when it comes to performance of large-scale recommender systems solutions:
- Huge datasets: Commercial recommenders are trained on huge datasets, often several terabytes in scale.
- Complex data preprocessing and feature engineering pipelines: Datasets need to be preprocessed and transformed into a form relevant to be used with DL models and frameworks. In addition, feature engineering creates an extensive set of new features from existing ones, requiring multiple iterations to arrive at an optimal solution.
- Input bottleneck: Data loading, if not well optimized, can be the slowest part of the training process, leading to under-utilization of high-throughput computing devices such as GPUs.
- Extensive repeated experimentation: The whole data engineering, training, and evaluation process is generally repeated many times, requiring significant time and computational resources.
To meet the computational demands for large-scale DL recommender systems training and inference, recommender-on-GPU solutions aim to provide fast feature engineering and high training throughput (to enable both fast experimentation and production retraining), as well as low latency, high-throughput inference.
Current DLbased models for recommender systems include the [Wide and
Deep](https://arxiv.org/abs/1606.07792) model, Deep Learning Recommendation Model
([DLRM](https://github.com/facebookresearch/dlrm)), neural collaborative filtering
([NCF](https://arxiv.org/abs/1708.05031)), Variational Autoencoder
([VAE](https://arxiv.org/abs/1802.05814)) for Collaborative Filtering, and
[BERT4Rec](https://arxiv.org/pdf/1904.06690.pdf), among others.
## How does DL-based recommender systems work?
In [NVIDIA Deep Learning Examples](https://github.com/NVIDIA/DeepLearningExamples), we introduce several popular state-of-the-art DL-based recommender models in Tensorflow and PyTorch.
As an example, we would like to start with discussing our reference implementation of DLRM. With DLRM, we systematically tackle the challenges mentioned by designing a complete DLRM pipeline, from data preparation to training to production inference. We provide ready-to-go Docker images for training and inference, data downloading and preprocessing tools, and Jupyter demo notebooks to get you started quickly. Also, trained models can be prepared for production inference in one simple step with our exporter tool.
For more details on the model architectures, example code, and how to set to end-to-end data processing, training, and inference pipeline on GPU, please refer to the [DLRM developer blog](https://developer.nvidia.com/blog/optimizing-dlrm-on-nvidia-gpus/) and [NVIDIA GPU-accelerated DL model portfolio ](https://github.com/NVIDIA/DeepLearningExamples) under /PyTorch/Recommendation/DLRM.
In addition, DLRM forms part of NVIDIA [Merlin](https://developer.nvidia.com/nvidia-merlin), a framework for building high-performance, DLbased recommender systems.
## Use cases and applications
### E-Commerce & Retail: Personalized Merchandising
Imagine a user has already purchased a scarf. Why not offer buying a hat that matches this hat, so that the look will be complete? This feature is often implemented by means of AI-based algorithms as “Complete the look” or “You might also like” sections in e-commerce platforms like Amazon, Walmart, Target, and many others.
On average, an intelligent recommender systems delivers a [22.66% lift in conversions rates](https://brandcdn.exacttarget.com/sites/exacttarget/files/deliverables/etmc-predictiveintelligencebenchmarkreport.pdf) for web products.
### Media & Entertainment: Personalized Content
AI based recommender engines can analyze the individual purchase behavior and detect patterns that will help provide a certain user with the content suggestions that will match his or her interests most likely. This is what Google and Facebook actively apply when recommending ads, or what Netflix does behind the scenes when recommending movies and TV shows.
### Personalized Banking
A mass market product that is consumed digitally by millions, banking is prime for recommendations. Knowing a customers detailed financial situation and their past preferences, coupled by data of thousands of similar users, is quite powerful.

View file

@ -0,0 +1,97 @@
# Segmentation
Image Segmentation is the field of image processing that deals with separating the image into multiple subgroups or regions (such as pixels set, also known as image segments) representing distinctive objects or its subparts.
Nowadays, we are constantly making interpretations of the world around us through cameras and other devices. Therefore image segmentation has become an integral part of our lives, since it's an indispensable technique for teaching the devices how to process this interpretation, how to understand the world around them.
In this collection, we will cover:
- What is image segmentation?
- Types of image segmentation
- How does image segmentation work?
- Use-cases and applications
- Where to get started
---
## What is image segmentation?
Image segmentation is a computer vision process by which a digital image is divided into various categories or segments. We use this method to understand what is depicted using a pixel-wise classification of the image. It is very much distinct from image classification, which allots labels to an entire image; object detection identifies and locates objects within an image by drawing bounding boxes around them. Image segmentation presents more pixel-level knowledge about the image content.
Consider a road side scenario with pedestrians, cars and lights:
![](img/3_image-segmentation-figure-1.png)
This photo is made up of an immense number of individual pixels, and image segmentation aims to assign each of those pixels to the object to which it belongs. Segmentation of an image enables us to segregate the foreground from the background, identify a road or a car's precise location, and mark the margins that separate a pedestrian from a car or road.
---
## Types of image segmentation
Image segmentation tasks can be broken down into two broad categories: semantic segmentation and instance segmentation.
1. Semantic segmentation:- This is the process of classifying each pixel belonging to a particular label. It doesn't different across different instances of the same object. For example if there are 2 cats in an image, semantic segmentation gives same label to all the pixels of both cats
2. Instance segmentation:- This differs from semantic segmentation in the sense that it gives a unique label to every instance of a particular object in the image. As can be seen in the image above all 3 dogs are assigned different colours i.e different labels. With semantic segmentation all of them would have been assigned the same colour.
---
## How does image segmentation work?
Let's consider image segmentation as a function.
An image is given as input to the function and it gives a matrix or a mask as the output, where each element tells us which class or instance that pixel belongs to.
Machine learning moves towards image segmentation train models to recognize which features of an image are crucial, rather than designing bespoke heuristics by hand.
Although deep neural networks architectures for image segmentation may differ in implementation, most follows similar basis structure:
![](img/3_image-segmentation-figure-2.png)
Source - [SegNet Paper](https://arxiv.org/pdf/1511.00561.pdf)
- The encoder: A set of layers that extract features of an image through a sequence of progressively narrower and deeper filters. Oftentimes, the encoder is pre-trained on a different task (like image recognition), where it learns statistical correlations from many images and may transfer that knowledge for the purposes of segmentation.
- The Decoder: A set of layers that progressively grows the output of the encoder into a segmentation mask resembling the pixel resolution of the input image.
- Skip connections: Long range connections in the neural network that allow the model to draw on features at varying spatial scales to improve model accuracy.
Most of the architectures used for segmentation tasks are built on the technique of Fully Convolutional Network (FCN) i.e., the architecture contains convolution layers instead of any Dense or Max Pool layers. Though various models support the FCN technique, a few handpicked models generally used in production are - UNet, MaskRCNN, and DeepLabv3.
---
## Use-cases and applications
Image Segmentation can be useful for a lot of different use-cases - handwriting recognition, virtual try-on, visual image search, road scene segmentation, organ segmentation and much more. Here are the few applications explained in detail:
### Autonomous vehicles:
There are a lot of things that needs your attention while driving- the road, other vehicles, pedestrians, sidewalks, and (potentially) a plethora of other potential obstacles/safety hazards.
If youve been driving for a long time, noticing and reacting to this environment might seem automatic or like second nature. In case of a self driving car, it would be a quick observation that this car needs to see, interpret, and respond to a scene in real-time. This implies the need to create pixel-level map of the world through the camera system in this vehicle in order to navigate it safely and efficiently.
Even though the field of autonomous machines/automobiles is much more complex than performing segmentation, this pixel-level understanding is a essential ingredient in a step towards reality.
![](img/3_image-segmentation-figure-3.png)
### Medical imaging and diagnostics:
In the initial steps of a diagnostic and treatment pipeline for many conditions that require medical images, such as CT or MRI scans, image segmentation can be used as a powerful technique.
Essentially, segmentation can effectively distinguish and separate homogeneous areas that may include particularly important pixels of organs, lesions, etc. However, there are significant challenges, including low contrast, noise, and various other imaging ambiguities.
![](img/3_image-segmentation-figure-4.png)
### Virtual try-on:
Virtual try on clothes is quite a fascinating feature which was available in stores using specialized hardware which creates a 3d model. But interestingly with deep learning and image segmentation the same can be obtained using just a 2d image.
![](img/3_image-segmentation-figure-5.png)
---
## Where to get started
NVIDIA provides Deep Learning Examples for Image Segmentation on its GitHub repository. These examples provide you with easy to consume and highly optimized scripts for both training and inferencing. The quick start guide at our GitHub repository will help you in setting up the environment using NGC Docker Images, download pre-trained models from NGC and adapt the model training and inference for your application/use-case.
Here are the examples relevant for image segmentation, directly from [Deep Learning Examples](https://github.com/NVIDIA/DeepLearningExamples):
1. 3D UNet for Medical Image Segmentation using Tensorflow 1.x
- [Git repository](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/Segmentation/UNet_3D_Medical)
- Uses TensorFlow 20.06-tf1-py3 [NGC container](https://ngc.nvidia.com/registry/nvidia-tensorflow)
2. 2D UNet for Industrial Defect Segmentation using Tensorflow 1.x
- [Git repository](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/Segmentation/UNet_Industrial)
- Uses TensorFlow 20.06-tf1-py3 [NGC container](https://ngc.nvidia.com/registry/nvidia-tensorflow)
3. MaskRCNN for Common Objects Segmentation using PyTorch
- [Git repository](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Segmentation/MaskRCNN)
- Uses PyTorch 20.06-py3 [NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)

Binary file not shown.

After

Width:  |  Height:  |  Size: 60 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 87 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 374 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 43 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 414 KiB

View file

@ -0,0 +1,58 @@
# Machine Translation
Machine Translation is the task of translation text from one language to another. Simply replacing one word with it's equivalent in another language rarely produces a semantically meaningful translation, because that may not account for the phrase-level meaning at all. A good machine translation system may require modeling whole sentences or phrases. Use of Neural Networks has allowed end-to-end architectures that can accomplish this, mapping from input text to the corresponding output text.A good model should be able to handle challenges like morphologically rich languages and very large vocalbularies well, while maintaining reasonable training and inference times. This Collection contains state-of-the-art models and containers that can help with the task of Machine Translation.
In this collection, we will cover:
- Challenges in Machine Translation
- Model architecture
- Where to get started
---
## Challenges in Machine Translation
Ages before, it was very time consuming to translate the text from an unfamiliar language. Adopting simple vocabularies with word-for-word translation was challenging for two purposes: 1) the user had to know the grammar rules, and 2) must keep in mind all language transcriptions while translating the whole sentence.
Presently, we don't need to struggle so much we can translate phrases, sentences, and even large texts just by putting them in Google Translate.
If the Google Translator tried to keep the translations for even short sentences, it wouldn't work because of the massive number of possible variations. The most useful approach can be to train the machine sets of grammar rules and translate them accordingly. If only it were as easy as it sounds.
Suppose you have ever tried discovering a foreign language. In that case, you comprehend that there are always many exceptions to rules when we try to capture all these rules, limitations, and exceptions to the program's peculiarities, the quality of translation fragments down.
---
## Model architecture
i) Googles Neural Machine Translation:
Sequence-to-Sequence (seq2seq) models are used for several Natural Language Processing (NLP) jobs, such as text summarization, speech recognition, and nucleotide sequence modeling. We aim to translate the provided sentences from one language to another.
Here, both the input and output are sentences. In another way, these sentences are a sequence of words proceeding in and out of the network. It is the fundamental purpose of Sequence-to-Sequence modeling. The figure underneath tries to demonstrate this technique.
![Basic Architecture](img/6_machine-translation-figure-1.png)
Source - https://developer.nvidia.com/blog/introduction-neural-machine-translation-with-gpus/
The GNMT v2 model is related to the one addressed in [Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation](https://arxiv.org/abs/1609.08144) paper.
The most crucial difference between the two models is in the attention mechanism. In the version2 (v2) model, the decoder's output from the primary LSTM layer goes into the attention module. The re-weighted setting is then concatenated with inputs to all subsequent LSTM layers in the decoder at the present step.
![Basic Architecture](img/6_machine-translation-figure-2.png)
ii) Transformer based Neural Machine Translation:
The Transformer model uses typical NMT encoder-decoder architecture. Unlike other NMT models, this method uses no repeated contacts and works on a rigid-sized context windowpane. The encoder stack is made up of N identical layers. The individual layer is composed of the subsequent sublayers: 1. Self-attention layer 2. Feedforward network (which is two fully-connected layers) Like the encoder stack, the decoder stack comprises N identical layers. Each layer is composed of the sublayers: 1. Self-attention, layer 2. Multi-headed attention layer merging encoder outputs with events from the previous self-attention layer. 3. Feedforward network (2 fully-connected layers)
The encoder uses self-attention to calculate a representation of the input sequence. The decoder generates the output sequence one token at a time, taking the encoder output and former decoder-outputted tickets as inputs. The model also applies embeddings on the input and output tokens and adds a fixed positional encoding. The positional encoding adds knowledge about the location of each token.
![Basic Architecture](img/6_machine-translation-figure-3.png)
Source - [Attention is all you Need](https://arxiv.org/abs/1706.03762)
---
## Where to get started
NVIDIA provides Deep Learning Examples for Image Segmentation on its GitHub repository. These examples provide you with easy to consume and highly optimized scripts for both training and inferencing. The quick start guide at our GitHub repository will help you in setting up the environment using NGC Docker Images, download pre-trained models from NGC and adapt the model training and inference for your application/use-case.
Here are the examples relevant for image segmentation, directly from [Deep Learning Examples](https://github.com/NVIDIA/DeepLearningExamples):
1. Machine translation with GNMT using PyTorch
- [Git repository](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/Translation/GNMT)
- Uses TensorFlow 20.06-tf1-py3 [NGC container](https://ngc.nvidia.com/registry/nvidia-tensorflow)
2. Machine translation with Transformers using PyTorch
- [Git repository](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Translation/Transformer)
- Uses PyTorch 20.03-py3 [NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)

Binary file not shown.

After

Width:  |  Height:  |  Size: 17 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 164 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 172 KiB

View file

@ -0,0 +1,93 @@
# Language Modeling
Language modeling (LM) is a natural language processing (NLP) task that determines the probability of a given sequence of words occurring in a sentence.
In an era where computers, smartphones and other electronic devices increasingly need to interact with humans, language modeling has become an indispensable technique for teaching devices how to communicate in natural languages in human-like ways.
But how does language modeling work? And what can you build with it? What are the different approaches, what are its potential benefits and limitations, and how might you use it in your business?
In this guide, youll find answers to all of those questions and more. Whether youre an experienced machine learning engineer considering implementation, a developer wanting to learn more, or a product manager looking to explore whats possible with natural language processing and language modeling, this guide is for you.
Heres a look at what well cover:
- Language modeling the basics
- How does language modeling work?
- Use cases and applications
- Getting started
## Language modeling the basics
### What is language modeling?
"*Language modeling is the task of assigning a probability to sentences in a language. […]
Besides assigning a probability to each sequence of words, the language models also assign a
probability for the likelihood of a given word (or a sequence of words) to follow a sequence
of words.*" Source: Page 105, [Neural Network Methods in Natural Language Processing](http://amzn.to/2wt1nzv), 2017.
### Types of language models
There are primarily two types of Language Models:
- Statistical Language Models: These models use traditional statistical techniques like N-grams, Hidden Markov Models (HMM), and certain linguistic rules to learn the probability distribution of words.
- Neural Language Models: They use different kinds of Neural Networks to model language, and have surpassed the statistical language models in their effectiveness.
"*We provide ample empirical evidence to suggest that connectionist language models are
superior to standard n-gram techniques, except their high computational (training)
complexity.*" Source: [Recurrent neural network based language model](http://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf), 2010.
Given the superior performance of neural language models, we include in the container two popular state-of-the-art neural language models: BERT and Transformer-XL.
### Why is language modeling important?
Language modeling is fundamental in modern NLP applications. It enables machines to understand qualitative information, and enables people to communicate with machines in the natural languages that humans use to communicate with each other.
Language modeling is used directly in a variety of industries, including tech, finance, healthcare, transportation, legal, military, government, and more -- actually, you probably have just interacted with a language model today, whether it be through Google search, engaging with a voice assistant, or using text autocomplete features.
## How does language modeling work?
The roots of modern language modeling can be traced back to 1948, when Claude Shannon
published a paper titled "A Mathematical Theory of Communication", laying the foundation for information theory and language modeling. In the paper, Shannon detailed the use of a stochastic model called the Markov chain to create a statistical model for the sequences of letters in English text. The Markov models, along with n-gram, are still among the most popular statistical language models today.
However, simple statistical language models have serious drawbacks in scalability and fluency because of its sparse representation of language. Overcoming the problem by representing language units (eg. words, characters) as a non-linear, distributed combination of weights in continuous space, neural language models can learn to approximate words without being misled by rare or unknown values.
Therefore, as mentioned above, we introduce two popular state-of-the-art neural language models, BERT and Transformer-XL, in Tensorflow and PyTorch. More details can be found in the [NVIDIA Deep Learning Examples Github Repository ](https://github.com/NVIDIA/DeepLearningExamples)
## Use cases and applications
### Speech Recognition
Imagine speaking a phrase to the phone, expecting it to convert the speech to text. How does
it know if you said "recognize speech" or "wreck a nice beach"? Language models help figure it out
based on the context, enabling machines to process and make sense of speech audio.
### Spelling Correction
Language-models-enabled spellcheckers can point to spelling errors and possibly suggest alternatives.
### Machine translation
Imagine you are translating the Chinese sentence "我在开车" into English. Your translation system gives you several choices:
- I at open car
- me at open car
- I at drive
- me at drive
- I am driving
- me am driving
A language model tells you which translation sounds the most natural.
## Getting started
NVIDIA provides examples for Language Modeling on [Deep Learning Examples Github Repository](https://github.com/NVIDIA/DeepLearningExamples). These examples provide you with easy to consume and highly optimized scripts for both training and inferencing. The quick start guide at our GitHub repository will help you in setting up the environment using NGC Docker Images, download pre-trained models from NGC and adapt the model training and inference for your application/use-case.
These models are tested and maintained by NVIDIA, leveraging mixed precision using tensor cores on our latest GPUs for faster training times while maintaining accuracy.

View file

@ -0,0 +1,97 @@
# Segmentation
Image Segmentation is the field of image processing that deals with separating the image into multiple subgroups or regions (such as pixels set, also known as image segments) representing distinctive objects or its subparts.
Nowadays, we are constantly making interpretations of the world around us through cameras and other devices. Therefore image segmentation has become an integral part of our lives, since it's an indispensable technique for teaching the devices how to process this interpretation, how to understand the world around them.
In this collection, we will cover:
- What is image segmentation?
- Types of image segmentation
- How does image segmentation work?
- Use-cases and applications
- Where to get started
---
## What is image segmentation?
Image segmentation is a computer vision process by which a digital image is divided into various categories or segments. We use this method to understand what is depicted using a pixel-wise classification of the image. It is very much distinct from image classification, which allots labels to an entire image; object detection identifies and locates objects within an image by drawing bounding boxes around them. Image segmentation presents more pixel-level knowledge about the image content.
Consider a road side scenario with pedestrians, cars and lights:
![](img/3_image-segmentation-figure-1.png)
This photo is made up of an immense number of individual pixels, and image segmentation aims to assign each of those pixels to the object to which it belongs. Segmentation of an image enables us to segregate the foreground from the background, identify a road or a car's precise location, and mark the margins that separate a pedestrian from a car or road.
---
## Types of image segmentation
Image segmentation tasks can be broken down into two broad categories: semantic segmentation and instance segmentation.
1. Semantic segmentation:- This is the process of classifying each pixel belonging to a particular label. It doesn't different across different instances of the same object. For example if there are 2 cats in an image, semantic segmentation gives same label to all the pixels of both cats
2. Instance segmentation:- This differs from semantic segmentation in the sense that it gives a unique label to every instance of a particular object in the image. As can be seen in the image above all 3 dogs are assigned different colours i.e different labels. With semantic segmentation all of them would have been assigned the same colour.
---
## How does image segmentation work?
Let's consider image segmentation as a function.
An image is given as input to the function and it gives a matrix or a mask as the output, where each element tells us which class or instance that pixel belongs to.
Machine learning moves towards image segmentation train models to recognize which features of an image are crucial, rather than designing bespoke heuristics by hand.
Although deep neural networks architectures for image segmentation may differ in implementation, most follows similar basis structure:
![](img/3_image-segmentation-figure-2.png)
Source - [SegNet Paper](https://arxiv.org/pdf/1511.00561.pdf)
- The encoder: A set of layers that extract features of an image through a sequence of progressively narrower and deeper filters. Oftentimes, the encoder is pre-trained on a different task (like image recognition), where it learns statistical correlations from many images and may transfer that knowledge for the purposes of segmentation.
- The Decoder: A set of layers that progressively grows the output of the encoder into a segmentation mask resembling the pixel resolution of the input image.
- Skip connections: Long range connections in the neural network that allow the model to draw on features at varying spatial scales to improve model accuracy.
Most of the architectures used for segmentation tasks are built on the technique of Fully Convolutional Network (FCN) i.e., the architecture contains convolution layers instead of any Dense or Max Pool layers. Though various models support the FCN technique, a few handpicked models generally used in production are - UNet, MaskRCNN, and DeepLabv3.
---
## Use-cases and applications
Image Segmentation can be useful for a lot of different use-cases - handwriting recognition, virtual try-on, visual image search, road scene segmentation, organ segmentation and much more. Here are the few applications explained in detail:
### Autonomous vehicles:
There are a lot of things that needs your attention while driving- the road, other vehicles, pedestrians, sidewalks, and (potentially) a plethora of other potential obstacles/safety hazards.
If youve been driving for a long time, noticing and reacting to this environment might seem automatic or like second nature. In case of a self driving car, it would be a quick observation that this car needs to see, interpret, and respond to a scene in real-time. This implies the need to create pixel-level map of the world through the camera system in this vehicle in order to navigate it safely and efficiently.
Even though the field of autonomous machines/automobiles is much more complex than performing segmentation, this pixel-level understanding is a essential ingredient in a step towards reality.
![](img/3_image-segmentation-figure-3.png)
### Medical imaging and diagnostics:
In the initial steps of a diagnostic and treatment pipeline for many conditions that require medical images, such as CT or MRI scans, image segmentation can be used as a powerful technique.
Essentially, segmentation can effectively distinguish and separate homogeneous areas that may include particularly important pixels of organs, lesions, etc. However, there are significant challenges, including low contrast, noise, and various other imaging ambiguities.
![](img/3_image-segmentation-figure-4.png)
### Virtual try-on:
Virtual try on clothes is quite a fascinating feature which was available in stores using specialized hardware which creates a 3d model. But interestingly with deep learning and image segmentation the same can be obtained using just a 2d image.
![](img/3_image-segmentation-figure-5.png)
---
## Where to get started
NVIDIA provides Deep Learning Examples for Image Segmentation on its GitHub repository. These examples provide you with easy to consume and highly optimized scripts for both training and inferencing. The quick start guide at our GitHub repository will help you in setting up the environment using NGC Docker Images, download pre-trained models from NGC and adapt the model training and inference for your application/use-case.
Here are the examples relevant for image segmentation, directly from [Deep Learning Examples](https://github.com/NVIDIA/DeepLearningExamples):
1. 3D UNet for Medical Image Segmentation using Tensorflow 1.x
- [Git repository](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/Segmentation/UNet_3D_Medical)
- Uses TensorFlow 20.06-tf1-py3 [NGC container](https://ngc.nvidia.com/registry/nvidia-tensorflow)
2. 2D UNet for Industrial Defect Segmentation using Tensorflow 1.x
- [Git repository](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/Segmentation/UNet_Industrial)
- Uses TensorFlow 20.06-tf1-py3 [NGC container](https://ngc.nvidia.com/registry/nvidia-tensorflow)
3. MaskRCNN for Common Objects Segmentation using PyTorch
- [Git repository](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Segmentation/MaskRCNN)
- Uses PyTorch 20.06-py3 [NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)

Binary file not shown.

After

Width:  |  Height:  |  Size: 60 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 87 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 374 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 43 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 414 KiB