IDT - Image Dataset Tool

Creator & Maintainer • Python CLI Tool

PythonCLIWeb ScrapingImage ProcessingDeep LearningOpen Source

Overview

IDT (Image Dataset Tool) is a command-line interface tool designed to make the otherwise repetitive and slow task of creating image datasets for deep learning into a fast and intuitive process. The tool streamlines the entire workflow from image collection to dataset preparation, significantly reducing the time and effort required to build high-quality training datasets.

With IDT, users can quickly scrape images from multiple search engines, optimize them for machine learning, remove duplicates automatically, and split datasets into training and validation sets—all through a simple command-line interface. A sample dataset containing 23,688 image files created with IDT weighs only 559.2 megabytes, demonstrating the tool's efficiency in dataset optimization.

This project has gained significant traction in the open-source community, with over 231 stars on GitHub and 27 forks, showcasing its value to developers and researchers working with image datasets for deep learning applications.

Key Features

Multi-Engine Image Scraping

Scrape images from multiple search engines including DuckDuckGo, Bing, Bing API, Flickr API, and DeviantArt. Choose the best engine for your needs or combine results from multiple sources for comprehensive dataset coverage.

Smart Image Optimization

Automatically optimize images for deep learning with multiple resize methods: longer side, shorter side, and SmartCrop—an intelligent cropping algorithm based on SmartCrop.js that focuses on the main subject of each image.

Automatic Duplicate Removal

Built-in duplicate detection and removal ensures your dataset contains only unique images, improving dataset quality and reducing storage requirements without manual intervention.

Dataset Splitting

Automatically split your dataset into training and validation folders with customizable proportions. Essential for deep learning workflows, this feature saves hours of manual organization.

Flexible Configuration

Interactive configuration wizard guides you through dataset setup. Define multiple classes, keywords per class, image sizes, and search parameters through an intuitive command-line interface.

Dataset Statistics

Automatically generates CSV files with comprehensive dataset statistics, including image counts per class, total dataset size, and other metadata to help you understand your dataset composition.

My Role & Contribution

Full Project Ownership

I conceived, designed, implemented, tested, and released IDT as a complete end-to-end project. From the initial idea of simplifying dataset creation for deep learning to the final published tool, every aspect of the project was developed independently.

Conception & Design

Identified the pain points in creating image datasets for machine learning and designed a solution that combines web scraping, image processing, and dataset management into a single, intuitive CLI tool. The architecture supports multiple search engines and flexible configuration options.

Implementation

Built the entire tool in Python, implementing web scraping capabilities for multiple search engines, image processing algorithms including SmartCrop integration, duplicate detection, and dataset splitting functionality. The codebase is well-structured, maintainable, and follows Python best practices.

Testing & Quality Assurance

Thoroughly tested all features including image scraping from different sources, various resize methods, duplicate detection accuracy, and dataset splitting functionality. Ensured the tool works reliably across different operating systems and Python versions.

Release & Open Source

Published the tool as an open-source project on GitHub with comprehensive documentation, installation instructions, usage examples, and contribution guidelines. The project is available via PyPI for easy installation using pip, making it accessible to the broader machine learning community.

Community & Maintenance

Actively maintain the project, responding to issues, reviewing pull requests, and releasing updates. The project has gained significant community support with over 231 GitHub stars and continues to grow as developers discover its value for their machine learning workflows.

Technical Highlights

Search Engine Integration

Implemented scrapers for multiple search engines with different APIs and authentication methods:

DuckDuckGo (default, no API key required)
Bing (web scraping)
Bing API (official API integration)
Flickr API (with API key support)

Image Processing Algorithms

Advanced image processing capabilities:

Longer side resize method
Shorter side resize method
SmartCrop (subject-focused cropping)
Automatic duplicate detection
Image compression and optimization

CLI Design

User-friendly command-line interface with interactive configuration wizard. Commands include:idt runidt initidt buildidt split

Dataset Management

Comprehensive dataset organization features including YAML configuration files, automatic folder structure creation, CSV statistics generation, and train/validation splitting with customizable proportions.

Project Impact

231+

GitHub Stars

Forks

MIT

Open Source License

Learn More

View on GitHub Install via PyPI