
SageMaker Built-in Algorithms
/ 5 min read
Table of Contents
Amazon SageMaker offers a wide range of built-in algorithms to simplify and accelerate machine learning (ML) projects. These algorithms are optimized for performance and scalability, making them suitable for various use cases. SageMaker’s built-in algorithms are categorized based on their learning paradigms, including supervised, unsupervised, textual analysis, and image processing.
Supervised Learning
In supervised learning, algorithms are trained on labeled data, where the desired solutions (labels) are included in the training data. SageMaker provides several general-purpose supervised learning algorithms that can be used for classification or regression problems.
- Linear Learner: A versatile algorithm for both classification and regression tasks. It is particularly effective for problems with large datasets and high dimensionality. The Linear Learner algorithm learns a linear function for regression or a linear threshold function for classification. (read this article to learn more)
- Factorization Machines: An extension of a linear model designed to capture interactions between features within high-dimensional sparse datasets. It is a powerful algorithm for recommendation systems and tasks involving sparse datasets, excelling at capturing interactions between features. (read this article to learn more)
- XGBoost: An implementation of the gradient-boosted trees algorithm that combines an ensemble of estimates from a set of simpler and weaker models. It is a popular gradient boosting algorithm known for its high accuracy and efficiency and is widely used for classification and regression tasks.
- K-Nearest Neighbors (k-NN): A non-parametric method that uses the k nearest labeled points to assign a value. It is a simple but effective algorithm for classification and regression that classifies data points based on the classes of their nearest neighbors in the feature space. For classification, it assigns a label to a new data point, and for regression, it predicts a target value from the average of the k nearest points.
- AutoGluon-Tabular: An open-source AutoML framework that succeeds by ensembling models and stacking them in multiple layers.
- CatBoost: An implementation of the gradient-boosted trees algorithm that introduces ordered boosting and an innovative algorithm for processing categorical features.
- LightGBM: An implementation of the gradient-boosted trees algorithm that adds two novel techniques for improved efficiency and scalability: Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB).
- TabTransformer: A deep tabular data modeling architecture built on self-attention-based Transformers.
Unsupervised Learning
Unsupervised learning algorithms process unlabeled data to discover patterns or relationships within the data. These algorithms are used for tasks such as clustering, dimensionality reduction, anomaly detection, and pattern recognition.
Clustering:
- K-Means: An algorithm that finds discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups. It is a widely used clustering algorithm that partitions data points into clusters based on their similarity.
Topic Modeling:
- Latent Dirichlet Allocation (LDA): A probabilistic topic modeling algorithm that uncovers hidden topics within a collection of documents. LDA is also effective for topic modeling.
Embeddings:
- Object2Vec: An unsupervised learning algorithm that compares pairs of data points and preserves the semantics of the relationship between the pairs. The algorithm creates embeddings that can be used by other algorithms downstream and can be used for product search, item matching, and more. It is useful for tasks like recommendation systems and information retrieval.
Anomaly Detection:
- Random Cut Forest (RCF): An unsupervised algorithm used to identify anomalies in data by providing an anomaly score for each data point. It is an efficient algorithm for detecting anomalies in streaming data and is particularly useful for applications like fraud detection and network security.
- IP Insights: An anomaly detection algorithm to detect problems and threats in an IP network by learning the usage patterns for IPv4 addresses. It is an algorithm for identifying potentially malicious IP addresses based on their activity patterns.
Dimension Reduction:
- Principal Component Analysis (PCA): An unsupervised learning algorithm used to reduce the number of features in data while retaining as much information as possible. It is a widely used technique for reducing the dimensionality of data while preserving important information.
Image/Videos
SageMaker provides algorithms and pre-trained models for various image and video processing tasks.
- Image Classification: Built on ResNet architecture for image categorization. SageMaker provides pre-trained models and frameworks like ResNet and ImageNet for image classification tasks.
- Object Detection: Uses Single Shot Detector (SSD) for identifying multiple objects in images. Similar to image classification, SageMaker offers pre-trained models and frameworks for object detection.
- Semantic Segmentation: Implements FCN algorithm for pixel-level image classification. Algorithms include Fully Convolutional Network (FCN) for pixel-wise classification of images, assigning a label to each pixel; Pyramid Scene Parsing (PSP) for scene parsing, understanding the context and relationships between objects in an image; and DeepLab V3 with ResNet, a state-of-the-art algorithm for semantic segmentation, achieving high accuracy in image segmentation tasks.
Time Series
- DeepAR: A recurrent neural network (RNN) based algorithm for forecasting time series data. It is useful for applications like demand forecasting and financial modeling.
Text
SageMaker offers algorithms tailored to the analysis of texts and documents used in natural language processing and translation.
- Text Classification:
- BlazingText: A highly optimized algorithm for text classification, achieving fast training and inference speeds.
- Word2Vec:
- BlazingText: BlazingText can also be used for learning word embeddings, representing words as dense vectors.
- Machine Translation:
- Sequence to Sequence: A neural network architecture for machine translation, translating sequences of words from one language to another.
- Topic Modeling:
- Latent Dirichlet Allocation (LDA): As mentioned earlier, LDA is also effective for topic modeling.
- Neural Topic Modeling (NTM): A deep learning based approach to topic modeling, uncovering hidden topics within text data.
Speech
- Sequence to Sequence: This architecture can also be applied to speech recognition and generation tasks, processing sequences of audio data.