All Projects β†’ activeloopai β†’ Hub

activeloopai / Hub

Licence: mpl-2.0
Dataset format for AI. Build, manage, & visualize datasets for deep learning. Stream data real-time to PyTorch/TensorFlow & version-control it. https://activeloop.ai

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Hub

Dvc
πŸ¦‰Data Version Control | Git for Data & Models | ML Experiments Management
Stars: ✭ 9,004 (+124.93%)
Mutual labels:  ai, collaboration, data-science, developer-tools, data-version-control
Awesome Mlops
A curated list of references for MLOps
Stars: ✭ 7,119 (+77.84%)
Mutual labels:  ai, data-science, ml, mlops
Metaflow
πŸš€ Build and manage real-life data science projects with ease!
Stars: ✭ 5,108 (+27.6%)
Mutual labels:  ai, data-science, ml, mlops
Polyaxon
Machine Learning Platform for Kubernetes (MLOps tools for experimentation and automation)
Stars: ✭ 2,966 (-25.91%)
Mutual labels:  ai, data-science, ml, mlops
Awesome Mlops
😎 A curated list of awesome MLOps tools
Stars: ✭ 258 (-93.55%)
Mutual labels:  ai, data-science, ml
Atlas
An Open Source, Self-Hosted Platform For Applied Deep Learning Development
Stars: ✭ 259 (-93.53%)
Mutual labels:  ai, data-science, ml
Csinva.github.io
Slides, paper notes, class notes, blog posts, and research on ML πŸ“‰, statistics πŸ“Š, and AI πŸ€–.
Stars: ✭ 342 (-91.46%)
Mutual labels:  ai, data-science, ml
Retriever
Quickly download, clean up, and install public datasets into a database management system
Stars: ✭ 241 (-93.98%)
Mutual labels:  data-science, dataset, datasets
Hyperparameter hunter
Easy hyperparameter optimization and automatic result saving across machine learning algorithms and libraries
Stars: ✭ 648 (-83.81%)
Mutual labels:  ai, data-science, ml
Pycm
Multi-class confusion matrix library in Python
Stars: ✭ 1,076 (-73.12%)
Mutual labels:  ai, data-science, ml
Datasciencevm
Tools and Docs on the Azure Data Science Virtual Machine (http://aka.ms/dsvm)
Stars: ✭ 153 (-96.18%)
Mutual labels:  ai, data-science, ml
Cleora
Cleora AI is a general-purpose model for efficient, scalable learning of stable and inductive entity embeddings for heterogeneous relational data.
Stars: ✭ 303 (-92.43%)
Mutual labels:  ai, datasets, ml
Oie Resources
A curated list of Open Information Extraction (OIE) resources: papers, code, data, etc.
Stars: ✭ 283 (-92.93%)
Mutual labels:  ai, data-science, dataset
Nlpaug
Data augmentation for NLP
Stars: ✭ 2,761 (-31.03%)
Mutual labels:  ai, data-science, ml
Imodels
Interpretable ML package πŸ” for concise, transparent, and accurate predictive modeling (sklearn-compatible).
Stars: ✭ 194 (-95.15%)
Mutual labels:  ai, data-science, ml
Depthai
DepthAI Python API utilities, examples, and tutorials.
Stars: ✭ 203 (-94.93%)
Mutual labels:  ai, ml, cv
Openml R
R package to interface with OpenML
Stars: ✭ 81 (-97.98%)
Mutual labels:  data-science, dataset, datasets
Codesearchnet
Datasets, tools, and benchmarks for representation learning of code.
Stars: ✭ 1,378 (-65.58%)
Mutual labels:  data-science, datasets, ml
Modelchimp
Experiment tracking for machine and deep learning projects
Stars: ✭ 121 (-96.98%)
Mutual labels:  ai, data-science, ml
Bentoml
Model Serving Made Easy
Stars: ✭ 3,064 (-23.46%)
Mutual labels:  ai, ml, mlops


Dataset Format for AI

Docs PyPI version PyPI version GitHub issues codecov

Documentation β€’ Getting Started β€’ API Reference β€’ Examples β€’ Blog β€’ Slack Community β€’ Twitter

About Hub

Hub is a dataset format with a simple API for creating, storing, and collaborating on AI datasets of any size. The hub data layout enables rapid tranformations and streaming of data while training models at scale. Hub is used by Google, Waymo, Red Cross, Oxford University, and Omdena.

Hub includes the following features:

  • Storage agnostic API: Use the same API to upload, download, and stream datasets to/from AWS S3/S3-compatible storage, GCP, Activeloop cloud, local storage, as well as in-memory.
  • Compressed storage: Store images, audios and videos in their native compression, decompressing them only when needed, for e.g., when training a model.
  • Lazy NumPy-like slicing: Treat your S3 or GCP datasets as if they are a collection of NumPy arrays in your system's memory. Slice them, index them, or iterate through them. Only the bytes you ask for will be downloaded!
  • Dataset version control: Commits, branches, checkout - Concepts you are already familiar with in your code repositories can now be applied to your datasets as well.
  • Third-party integrations: Hub comes with built-in integrations for Pytorch and Tensorflow. Train your model with a few lines of code - we even take care of dataset shuffling. :)
  • Distributed transforms: Rapidly apply transformations on your datasets using multi-threading, multi-processing, or our built-in Ray integration.
  • Instant visualization support: Hub datasets are instantly visualized with bounding boxes, masks, annotations, etc. in Activeloop Platform (see below).

Getting Started with Hub

πŸš€ How to install Hub

Hub is written in 100% Python and can be quickly installed using pip.

pip3 install hub

🧠 Training a PyTorch model on a Hub dataset

Load CIFAR-10, one of the readily available datasets in Hub:

import hub
import torch
from torchvision import transforms, models

ds = hub.load('hub://activeloop/cifar10-train')

Inspect tensors in the dataset:

ds.tensors.keys()    # dict_keys(['images', 'labels'])
ds.labels[0].numpy() # array([6], dtype=uint32)

Train a PyTorch model on the Cifar-10 dataset without the need to download it

First, define a transform for the images and use Hub's built-in PyTorch one-line dataloader to connect the data to the compute:

tform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5]),
])

hub_loader = ds.pytorch(num_workers=0, batch_size=4, transform={
                        'images': tform, 'labels': None}, shuffle=True)

Next, define the model, loss and optimizer:

net = models.resnet18(pretrained=False)
net.fc = torch.nn.Linear(net.fc.in_features, len(ds.labels.info.class_names))
    
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

Finally, the training loop for 2 epochs:

for epoch in range(2):
    running_loss = 0.0
    for i, data in enumerate(hub_loader):
        images, labels = data['images'], data['labels']
        
        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = net(images)
        loss = criterion(outputs, labels.reshape(-1))
        loss.backward()
        optimizer.step()
        
        # print statistics
        running_loss += loss.item()
        if i % 100 == 99:    # print every 100 mini-batches
            print('[%d, %5d] loss: %.3f' %
                (epoch + 1, i + 1, running_loss / 100))
            running_loss = 0.0

πŸ—οΈ How to create a Hub Dataset

A hub dataset can be created in various locations (Storage providers). This is how the paths for each of them would look like:

Storage provider Example path
Activeloop cloud hub://user_name/dataset_name
AWS S3 / S3 compatible s3://bucket_name/dataset_name
GCP gcp://bucket_name/dataset_name
Local storage path to local directory
In-memory mem://dataset_name

Let's create a dataset in the Activeloop cloud. Activeloop cloud provides free storage up to 300 GB per user (more info here). Create a new account with Hub from the terminal using activeloop register if you haven't already. You will be asked for a user name, email ID, and password. The user name you enter here will be used in the dataset path.

$ activeloop register
Enter your details. Your password must be at least 6 characters long.
Username:
Email:
Password:

Initialize an empty dataset in the Activeloop Cloud:

import hub

ds = hub.empty("hub://<USERNAME>/test-dataset")

Next, create a tensor to hold images in the dataset we just initialized:

images = ds.create_tensor("images", htype="image", sample_compression="jpg")

Assuming you have a list of image file paths, let's upload them to the dataset:

image_paths = ...
with ds:
    for image_path in image_paths:
        image = hub.read(image_path)
        ds.images.append(image)

Alternatively, you can also upload numpy arrays. Since the images tensor was created with sample_compression="jpg", the arrays will be compressed with jpeg compression.

import numpy as np

with ds:
    for _ in range(1000):  # 1000 random images
        random_image = np.random.randint(0, 256, (100, 100, 3))  # 100x100 image with 3 channels
        ds.images.append(random_image)

πŸš€ How to load a Hub Dataset

You can load the dataset you just created with a single line of code:

import hub

ds = hub.load("hub://<USERNAME>/test-dataset")

You can also access other publicly available hub datasets, not just the ones you created. Here is how you would load the Objectron Bikes Dataset:

import hub

ds = hub.load('hub://activeloop/objectron_bike_train')

To get the first image in the Objectron Bikes dataset in numpy format:

image_arr = ds.image[0].numpy()

πŸ“š Documentation

Getting started guides, examples, tutorials, API reference, and other useful information can be found on our documentation page.

πŸŽ“ For Students and Educators

Hub users can access and visualize a variety of popular datasets through a free integration with Activeloop's Platform. Users can also create and store their own datasets and make them available to the public. Free storage of up to 300 GB is available for students and educators:

Storage for public datasets hosted by Activeloop 200GB Free
Storage for private datasets hosted by Activeloop 100GB Free

πŸ‘©β€πŸ’» Comparisons to Familiar Tools

Hub vs DVC

Hub and DVC offer dataset version control similar to git for data, but their methods for storing data differ significantly. Hub converts and stores data as chunked compressed arrays, which enables rapid streaming to ML models, whereas DVC operates on top of data stored in less efficient traditional file structures. The Hub format makes dataset versioning significantly easier compared to traditional file structures by DVC when datasets are composed of many files (i.e., many images). An additional distinction is that DVC primarily uses a command-line interface, whereas Hub is a Python package. Lastly, Hub offers an API to easily connect datasets to ML frameworks and other common ML tools and enables instant dataset visualization through Activeloop's visualization tool.

Activeloop Hub vs TensorFlow Datasets (TFDS)

Hub and TFDS seamlessly connect popular datasets to ML frameworks. Hub datasets are compatible with both PyTorch and TensorFlow, whereas TFDS are only compatible with TensorFlow. A key difference between Hub and TFDS is that Hub datasets are designed for streaming from the cloud, whereas TFDS must be downloaded locally prior to use. As a result, with Hub, one can import datasets directly from TensorFlow Datasets and stream them either to PyTorch or TensorFlow. In addition to providing access to popular publicly available datasets, Hub also offers powerful tools for creating custom datasets, storing them on a variety of cloud storage providers, and collaborating with others via simple API. TFDS is primarily focused on giving the public easy access to commonly available datasets, and management of custom datasets is not the primary focus. A full comparison article can be found here.

Activeloop Hub vs HuggingFace

Hub and HuggingFace offer access to popular datasets, but Hub primarily focuses on computer vision, whereas HuggingFace focuses on natural language processing. HuggingFace Transforms and other computational tools for NLP are not analogous to features offered by Hub.

Community

Join our Slack community to learn more about unstructured dataset management using Hub and to get help from the Activeloop team and other users.

We'd love your feedback by completing our 3-minute survey.

As always, thanks to our amazing contributors!

Made with contributors-img.

Please read CONTRIBUTING.md to get started with making contributions to Hub.

README Badge

Using Hub? Add a README badge to let everyone know:

hub

[![hub](https://img.shields.io/badge/powered%20by-hub%20-ff5a1f.svg)](https://github.com/activeloopai/Hub)

Disclaimers

Dataset Licenses

Hub users may have access to a variety of publicly available datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have a license to use the datasets. It is your responsibility to determine whether you have permission to use the datasets under their license.

If you're a dataset owner and do not want your dataset to be included in this library, please get in touch through a GitHub issue. Thank you for your contribution to the ML community!

Usage Tracking

By default, we collect usage data using Bugout (here's the code that does it). It does not collect user data other than anonymized IP address data, and it only logs the Hub library's own actions. This helps our team understand how the tool is used and how to build features that matter to you! After you register with Activeloop, data is no longer anonymous. You can always opt-out of reporting using the CLI command below:

activeloop reporting --off

Acknowledgment

This technology was inspired by our research work at Princeton University. We would like to thank William Silversmith @SeungLab for his awesome cloud-volume tool.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].
OSZAR »