Skip to content

πŸš€ 500+ curated resources for Data Analysis & Data Science: Python, SQL, Statistics, ML, AI, Visualization, Cheatsheets, Roadmaps, Interview Prep. For beginners and experts.

License

Notifications You must be signed in to change notification settings

PavelGrigoryevDS/awesome-data-analysis

Awesome Data Analysis Awesome

Web Page PRs Welcome CC0

500+ curated resources for data analysis and data science: tools, libraries, roadmaps, cheatsheets, and interview guides.

πŸ“– For comfortable reading: Web version

🌱 Want to improve? Suggest here or Welcome to Discussions

🌟 Goal: 500 stars! Join us in making data analysis learning more accessible! GitHub stars

Maintained with ❀️


πŸ“‘ Contents


πŸ† Awesome Data Science Repositories

Curated collections of high-quality GitHub repos for inspiration and learning.

⬆ back to top


πŸ—ΊοΈ Roadmaps

Step-by-step guides and skill trees to master data science and analytics.

⬆ back to top


🐍 Python

Resources

A collection of resources for learning and mastering Python programming.

⬆ back to top


Data Manipulation with Pandas and Numpy

Tutorials and best practices for working with Pandas and Numpy.

⬆ back to top


Useful Python Tools for Data Analysis

A collection of Python libraries for efficient data manipulation, cleaning, visualization, validation, and analysis.

Data Processing & Transformation

  • Pandas DQ - Data type correction and automatic DataFrame cleaning.
  • Vaex - High-performance Python library for lazy Out-of-Core DataFrames.
  • Polars - Multithreaded, vectorized query engine for DataFrames.
  • Fugue - Unified interface for Pandas, Spark, and Dask.
  • TheFuzz - Fuzzy string matching (Levenshtein distance).
  • DateUtil - Extensions for standard Python datetime features.
  • Arrow - Enhanced work with dates and times.
  • Pendulum - Alternative to datetime with timezone support.
  • Dask - Parallel computing for arrays and DataFrames.
  • Modin - Speeds up Pandas by distributing computations.
  • Pandarallel - Parallel operations for pandas DataFrames.
  • DataCleaner - Python tool for automatically cleaning and preparing datasets.
  • Pandas Flavor - Add custom methods to Pandas.
  • Pandas DataReader - Reads data from various online sources into pandas DataFrames.
  • Sklearn Pandas - Bridge between Pandas and Scikit-learn.
  • CuPy - A NumPy-compatible array library accelerated by NVIDIA CUDA for high-performance computing.
  • Numba - A JIT compiler that translates a subset of Python and NumPy code into fast machine code.
  • Pandas Stubs - Type stubs for pandas, improves IDE autocompletion.
  • Petl - ETL tool for data cleaning and transformation.

⬆ back to top


Automated EDA and Visualization Tools

  • AutoViz - Automatic data visualization in 1 line of code.
  • Sweetviz - Automatic EDA with dataset comparison.
  • Lux - Automatic DataFrame visualization in Jupyter.
  • YData Profiling - Data quality profiling & exploratory data analysis.
  • Missingno - Visualize missing data patterns.
  • Vizro - Low-code toolkit for building data visualization apps.
  • Yellowbrick - Visual diagnostic tools for machine learning.
  • Great Tables - Create awesome display tables using Python.
  • DataMapPlot - Create beautiful plots of data maps.
  • Datashader - Quickly and accurately render even the largest data.
  • PandasAI - Conversational data analysis using LLMs and RAG.
  • Mito - Jupyter extensions for faster code writing.
  • D-Tale - Interactive GUI for data analysis in a browser.
  • Pandasgui - GUI for viewing and filtering DataFrames.
  • PyGWalker - Interactive UIs for visual analysis of DataFrames.
  • QGrid - Interactive grid for DataFrames in Jupyter.
  • Pivottablejs - Interactive PivotTable.js tables in Jupyter.

⬆ back to top


Data Quality & Validation

  • PyOD - Outlier and anomaly detection.
  • Alibi Detect - Outlier, adversarial and drift detection.
  • Pandera - Data validation through declarative schemas.
  • Cerberus - Data validation through schemas.
  • Pydantic - Data validation using Python type annotations.
  • Dora - Automate EDA: preprocessing, feature engineering, visualization.
  • Great Expectations - Data validation and testing.

⬆ back to top


Feature Engineering & Selection

  • FeatureTools - Automated feature engineering.
  • Feature Engine - Feature engineering with Scikit-Learn compatibility.
  • Prince - Multivariate exploratory data analysis (PCA, CA, MCA).
  • Fitter - Figures out the distribution your data comes from.
  • Feature Selector - Tool for dimensionality reduction of machine learning datasets.
  • Category Encoders - Extensive collection of categorical variable encoders.
  • Imbalanced Learn - Handling imbalanced datasets.

⬆ back to top


Specialized Data Tools

  • cuDF - A GPU DataFrame library for loading, joining, and aggregating data.
  • Faker - Generates fake data for testing.
  • Mimesis - Generates realistic test data.
  • Geopy - Geocoding addresses and calculating distances.
  • PySAL - Spatial analysis functions.
  • Factor Analyzer - A Python package for factor analysis, including exploratory and confirmatory methods.
  • Scattertext - Beautiful visualizations of language differences among document types.
  • IGraph - A library for creating and manipulating graphs and networks, with bindings for multiple languages.
  • Joblib - A lightweight pipelining library for Python, particularly useful for saving and loading large NumPy arrays.
  • ImageIO - A library that provides an easy interface to read and write a wide range of image data.
  • Texthero - Text preprocessing, representation and visualization.
  • Geopandas - Geographic data operations with pandas.
  • NetworkX - Network analysis and graph theory.

⬆ back to top


πŸ—ƒοΈ SQL & Databases

Resources

SQL tutorials and database design principles.

⬆ back to top


Tools

A collection of Python libraries and drivers for seamless database access and interaction.

  • PyODBC - Python library for ODBC database access.
  • SQLAlchemy - SQL toolkit and ORM for Python.
  • Psycopg2 - PostgreSQL database adapter.
  • MySQL Connector/Python - MySQL driver for Python.
  • PonyORM - ORM for Python with dynamic query generation.
  • PyMongo - Official MongoDB driver for Python.
  • SQLiteviz - A tool for exploring SQLite databases and visualizing the results of your queries.
  • SQLite - A C-language library that implements a small, fast, self-contained, high-reliability, full-featured SQL database engine.
  • DB Browser for SQLite - A high quality, visual, open source tool to create, design, and edit database files compatible with SQLite.
  • DBeaver - A free universal database tool and SQL client for developers, SQL programmers, and administrators.
  • Beekeeper Studio - A modern, easy-to-use SQL client and database manager with a clean, cross-platform interface.
  • SQLFluff - A modular SQL linter and auto-formatter designed to enforce consistent style and catch errors in SQL code.
  • PyMySQL - A pure-Python MySQL client library for interacting with MySQL databases from Python applications.
  • Vanna.AI - An AI-powered tool for generating SQL queries from natural language questions.
  • SQLChat - A chat-based SQL client that allows you to query databases using natural language conversations.
  • Records - SQL queries to databases via Python syntax.
  • Dataset - JSON-like interface for working with SQL databases.
  • SQLGlot - A no-dependency SQL parser, transpiler, and optimizer for Python.
  • TDengine - An open-source big data platform designed for time-series data, IoT, and industrial monitoring.
  • TimescaleDB - An open-source time-series SQL database optimized for fast ingest and complex queries.
  • DuckDB - In-memory analytical database for fast SQL queries.

⬆ back to top


πŸ“Š Data Visualization

Resources

Color theory, chart selection guides, and storytelling tips.

⬆ back to top


Tools

Libraries for static, interactive, and 3D visualizations.

  • Matplotlib - A comprehensive library for creating static, animated, and interactive visualizations in Python.
  • Seaborn - A statistical data visualization library based on Matplotlib.
  • Plotly - A library for creating interactive plots and dashboards.
  • Altair - A declarative statistical visualization library for Python.
  • Bokeh - A library for creating interactive visualizations for modern web browsers.
  • HoloViews - A tool for building complex visualizations easily.
  • Geopandas - An extension of Pandas for geospatial data.
  • Folium - A library for visualizing data on interactive maps.
  • Pygal - A Python SVG charting library.
  • Plotnine - A grammar of graphics for Python.
  • Bqplot - A plotting library for IPython/Jupyter notebooks.
  • PyPalettes - A large (+2500) collection of color maps for Python.
  • Deck.gl - A WebGL-powered framework for visual exploratory data analysis of large datasets.
  • Python for Geo - Contextily: add background basemaps to your plots in GeoPandas.
  • OSMnx - A package to easily download, model, analyze, and visualize street networks from OpenStreetMap.
  • Apache ECharts - A powerful, interactive charting and visualization library for browser-based applications.
  • VisPy - A high-performance interactive 2D/3D data visualization library leveraging the power of OpenGL.
  • Glumpy - A Python library for scientific visualization that is fast, scalable and beautiful, based on OpenGL.
  • Pandas-bokeh - Bokeh plotting backend for Pandas.

⬆ back to top


πŸ“ˆ Dashboards & BI

Resources

Ttutorials for building and enhancing dashboards and visualizations using various tools and frameworks.

⬆ back to top


Tools

Frameworks for building custom dashboard solutions.

  • Dash - Framework for creating interactive web applications.
  • Streamlit - Simplified framework for building data applications.
  • Panel - Framework for creating interactive web applications.
  • Gradio - Tool for creating and sharing machine learning applications.
  • OpenSearch Dashboards - A powerful data visualization and dashboarding tool for OpenSearch data, forked from Kibana.
  • GridStack.js - A library for building draggable, resizable responsive dashboard layouts.
  • Tremor - A React library to build dashboards fast with pre-built components for charts, KPIs, and more.
  • Appsmith - An open-source platform to build and deploy internal tools, admin panels, and CRUD apps quickly.
  • Grafanalib - A Python library for generating Grafana dashboards configuration as code.
  • H2O Wave - A Python framework for rapidly building and deploying realtime web apps and dashboards for AI and analytics.
  • Shiny for Python - Python version of the popular R Shiny framework.
  • VoilΓ  - Turn Jupyter notebooks into standalone web applications.
  • Reflex - Full-stack Python framework for building web apps.

⬆ back to top


Software

A list of leading tools and platforms for data visualization and dashboard creation.

  • Tableau - Leading data visualization software.
  • Microsoft Power BI - Business analytics tool for visualizing data.
  • QlikView - Tool for data visualization and business intelligence.
  • Metabase - User-friendly open-source BI tool.
  • Apache Superset - Open-source data exploration and visualization platform.
  • Preset - A platform for modern business intelligence, providing a hosted version of Apache Superset.
  • Metabase - The simplest way to get analytics and business intelligence for everyone in your company.
  • Redash - Tool for visualizing and sharing data insights.
  • Grafana - Dashboarding and monitoring tool.
  • Datawrapper - User-friendly chart and map creation tool.
  • ChartBlocks - Online chart creation platform.
  • Infogram - Tool for creating infographics and visual content.
  • Google Data Studio - Free tool for creating interactive dashboards and reports.
  • Rath - Next-generation automated data exploratory analysis and visualization platform.
  • Kibana - The official visualization and dashboarding tool for the Elastic Stack (Elasticsearch, Logstash, Beats).

⬆ back to top


πŸ•ΈοΈ Web Scraping & Crawling

Resources

A collection of valuable resources, tutorials, and libraries for web scraping with Python.

⬆ back to top


Tools

A list of Python libraries and tools for web scraping.

  • Requests - A simple, yet elegant, HTTP library for Python.
  • BeautifulSoup - A library for parsing HTML and XML documents.
  • Selenium - A tool for automating web applications for testing purposes.
  • Scrapy - An open-source and collaborative web crawling framework for Python.
  • Browser Use - A library for browser automation and web scraping.
  • Gerapy - Distributed Crawler Management Framework based on Scrapy, Scrapyd, Django, and Vue.js.
  • AutoScraper - A smart, automatic, fast, and lightweight web scraper for Python.
  • Feedparser - A library to parse feeds in Python.
  • Trafilatura - A Python & command-line tool to gather text and metadata on the web.
  • You-Get - A tiny command-line utility to download media contents (videos, audios, images) from the web.
  • MechanicalSoup - A Python library for automating interaction with websites.
  • ScrapeGraph AI - A Python scraper based on AI.
  • Snscrape - A social networking service scraper in Python.
  • Ferret - A web scraping system that lets you declaratively describe what data to extract using a simple query language.
  • Grab - A Python framework for building web scraping apps, providing a high-level API for asynchronous requests.
  • Playwright - Python version of the Playwright browser automation library.
  • PyQuery - A jQuery-like library for parsing HTML documents in Python.
  • Helium - High-level Selenium wrapper for easier web automation.
  • Scrapling - A framework for building web scrapers and crawlers.

⬆ back to top


πŸ”’ Mathematics

A collection of resources for learning mathematics, particularly in the context of data science and machine learning.

⬆ back to top


🎲 Statistics & Probability

Resources

A selection of resources focused on statistics and probability, including tutorials and comprehensive guides.

⬆ back to top


Tools

A collection of tools focused on statistics and probability.

  • SciPy - Fundamental library for scientific computing and statistics.
  • Statsmodels - Statistical modeling, testing, and data exploration.
  • PyMC - A probabilistic programming library for Python that allows for flexible Bayesian modeling.
  • Pingouin - Statistical package with improved usability over SciPy.
  • scikit-posthocs - Post-hoc tests for statistical analysis of data.
  • Lifelines - Survival analysis and event history analysis in Python.
  • scikit-survival - Survival analysis built on scikit-learn for time-to-event prediction.
  • Bootstrap - Bootstrap confidence interval estimation methods.
  • PyStan - Python interface to Stan for Bayesian statistical modeling.
  • ArviZ - Exploratory analysis of Bayesian models with visual diagnostics.
  • PyGAM - A Python library for generalized additive models with built-in smoothing and regularization.
  • NumPyro - A probabilistic programming library built on JAX for high-performance Bayesian modeling.
  • Causal Impact - A Python implementation of the R package for causal inference using Bayesian structural time-series models.
  • DoWhy - A Python library for causal inference that supports explicit modeling and testing of causal assumptions.
  • Patsy - A Python library for describing statistical models and building design matrices.
  • Pomegranate - Fast and flexible probabilistic modeling library for Python with GPU support.

⬆ back to top


πŸ§ͺ A/B Testing

A collection of resources focused on A/B testing.

⬆ back to top


⏳ Time Series Analysis

Resources

A collection of resources for understanding time series fundamentals and analytical techniques.

⬆ back to top


Tools

A collection of tools for working with temporal data.

  • Facebook Prophet - A procedure for forecasting time series data based on an additive model.
  • Uber Orbit - A Python package for Bayesian time series forecasting and inference.
  • sktime - A unified Python framework for machine learning with time series, compatible with scikit-learn.
  • GluonTS - A Python toolkit for probabilistic time series modeling, built on MXNet.
  • Time-Series-Library - A library for deep learning-based time series analysis and forecasting.
  • TimesFM - A pretrained time series foundation model from Google Research for zero-shot forecasting.
  • PyTorch Forecasting - A PyTorch-based library for time series forecasting with neural networks.
  • Time-series-prediction - A collection of time series prediction methods and implementations.
  • PlotJuggler - A tool to visualize and analyze time series data logs in real-time.
  • TSFresh - Automatically extracting features from time series data.
  • pmdarima - Python library for ARIMA modeling and time series analysis.
  • Kats - Toolkit for analyzing time series data from Facebook Research.

⬆ back to top


βš™οΈ Data Engineering

Resources

A collection of resources to help you build and manage robust data pipelines and infrastructure.

⬆ back to top


Tools

A collection of tools for building, deploying, and managing data pipelines and infrastructure.

  • dbt-core - A framework for transforming data in your warehouse using SQL and Jinja.
  • Apache Spark - A unified engine for large-scale data processing and analytics.
  • Apache Kafka - A distributed event streaming platform for building real-time data pipelines.
  • Dagster - A data orchestrator for machine learning, analytics, and ETL.
  • Apache Airflow - A platform to programmatically author, schedule, and monitor workflows.
  • Apache Hive - A data warehouse software for reading, writing, and managing large datasets in distributed storage using SQL.
  • Apache Hadoop - A framework that allows for the distributed processing of large data sets across clusters of computers.
  • Luigi - A Python module for building complex and batch-oriented data pipelines.
  • Apache Iceberg - A high-performance table format for huge analytic datasets.
  • Apache Cassandra - A highly scalable distributed NoSQL database designed for handling large amounts of data across many commodity servers.
  • Apache Flink - A framework for stateful computations over unbounded and bounded data streams (real-time stream processing).
  • Apache Beam - A unified model for defining both batch and streaming data-parallel processing pipelines.
  • Apache Pulsar - A cloud-native, distributed messaging and streaming platform.
  • Delta Lake - A storage layer that brings ACID transactions to Apache Spark and big data workloads.
  • Apache Hudi - An open data lakehouse platform, built on a high-performance open table format.
  • Trino - A distributed SQL query engine designed for fast analytic queries against large datasets.
  • DataHub - A metadata platform for the modern data stack.
  • OpenLineage - An open framework for collection and analysis of data lineage.
  • Kedro - A framework for creating reproducible, maintainable and modular data science code.
  • Apache Calcite - A dynamic data management framework that allows for SQL parsing, optimization, and federation.
  • Prefect - Workflow orchestration for building resilient data pipelines.
  • Apache Arrow - Universal columnar format and multi-language toolbox for fast data interchange.
  • Kestra - An open-source, event-driven orchestrator that simplifies data workflow management.

⬆ back to top


πŸ“– Natural Language Processing (NLP)

Resources

A selection of resources for learning and applying natural language processing in Python.

⬆ back to top


Tools

A collection of powerful libraries and frameworks for natural language processing in Python.

  • Natural Language Toolkit (NLTK) - A leading platform for building Python programs to work with human language data.
  • TextBlob - A simple library for processing textual data.
  • SpaCy - An open-source software library for advanced NLP in Python.
  • BERT - A transformer-based model for NLP tasks.
  • Flair - A simple framework for state-of-the-art NLP.
  • OpenHands - A library and framework for building applications with large language models.
  • Stanford CoreNLP - A Java suite of core NLP tools providing fundamental linguistic analysis capabilities.
  • John Snow Labs Spark-NLP - A state-of-the-art Natural Language Processing library built on Apache Spark.
  • TextAttack - A Python framework for adversarial attacks, data augmentation, and model training in NLP.
  • Gensim - Topic modeling and natural language processing library for Python.
  • Stanza - Python NLP library for many human languages, from the Stanford NLP Group.
  • SentenceTransformers - Framework for state-of-the-art sentence and text embeddings.

⬆ back to top


πŸ€– Machine Learning & AI

Resources

A collection of resources to help you learn and apply machine learning concepts and techniques.

⬆ back to top


Tools

A collection of tools for developing and deploying machine learning models.

Machine Learning

  • Scikit-learn - Machine learning library for classical algorithms and model building.
  • XGBoost - Optimized distributed gradient boosting library for tree-based models.
  • LightGBM - Fast, distributed, high-performance gradient boosting framework.
  • CatBoost - High-performance gradient boosting on decision trees with categorical features support.
  • H2O-3 - Open-source distributed machine learning platform.
  • cuML - GPU-accelerated machine learning algorithms from RAPIDS.
  • dlib - Modern C++ toolkit containing machine learning algorithms and tools.
  • SHAP - Game theoretic approach to explain the output of any machine learning model.
  • InterpretML - Fit interpretable models and explain blackbox machine learning.
  • Optuna - Hyperparameter optimization framework.

Deep Learning

  • TensorFlow - End-to-end open source platform for machine learning and deep learning.
  • PyTorch - Deep learning framework with strong support for research and production.
  • PyTorch Lightning - PyTorch wrapper for high-performance AI research.
  • PyTorch Ignite - High-level library to help with training and evaluating neural networks.
  • Keras - High-level neural networks API, running on top of TensorFlow.
  • Fast.ai - Deep learning library simplifying training fast and accurate neural nets.
  • HuggingFace Transformers - Model-definition framework for state-of-the-art machine learning models.
  • HuggingFace Diffusers - Library for state-of-the-art pretrained diffusion models.
  • PEFT - Library for efficiently adapting large pretrained models.
  • YOLOv5 - Real-time object detection system.
  • Ultralytics - YOLOv8 and other computer vision models.
  • ONNX - Open standard for machine learning interoperability.
  • PyTorch Geometric - Geometric deep learning extension library for PyTorch.
  • Pyro - Deep universal probabilistic programming with Python and PyTorch.
  • Skorch - Scikit-learn compatible neural network library.
  • Sonnet - DeepMind's library for building complex neural networks.
  • JAX - Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more.

⬆ back to top


πŸš€ MLOps

Resources

Materials and curated lists for machine learning operations.

⬆ back to top


Tools

Platforms and utilities for deploying, monitoring, and maintaining ML systems.

  • ColossalAI - High-performance distributed training framework.
  • DVC - Version control system for machine learning projects.
  • Evidently - Tool for analyzing and monitoring data and model drift.
  • Deepchecks - Validation for ML models and data.
  • Sematic - Tool to build, debug, and execute ML pipelines with native Python.
  • netdata - Real-time performance monitoring.
  • meilisearch - Fast, open-source search engine.
  • vLLM - High-throughput and memory-efficient inference library for LLMs.
  • haystack - LLM framework for building search and question answering systems.
  • Kubeflow - Machine learning toolkit for Kubernetes.
  • Seldon Core - Open source platform for deploying and monitoring machine learning models in production.
  • Feast - A feature store for machine learning that manages and serves ML features to models.
  • BentoML - Framework for building, shipping, and scaling ML applications.
  • MLflow - Open-source platform for the complete machine learning lifecycle.
  • Wandb - Tool for experiment tracking, dataset versioning, and model management.
  • Comet ML - ML platform for tracking, comparing and optimizing experiments.
  • Netflix Metaflow - A human-friendly Python library for helping scientists and engineers build and manage real-life data science projects.
  • mindsdb - Platform for integrating AI into databases and applications.
  • KServe - Standardized serverless inference platform for deploying and serving machine learning models on Kubernetes.
  • SQLFlow - Brings machine learning capabilities to SQL, enabling model training and prediction using SQL syntax.
  • Jina AI Serve - Framework for building and deploying AI services that communicate via gRPC, HTTP and WebSockets.
  • LiteLLM - Unified interface to call all LLM APIs (OpenAI, Anthropic, Cohere, etc.) with consistent output formatting.

⬆ back to top


🧠 AI Applications & Platforms

Resources

A collection of resources focused on AI applications and platforms.

  • Awesome LLM Apps - Collection of awesome LLM apps with AI Agents and RAG using OpenAI, Anthropic, Gemini and opensource models.
  • Awesome Generative AI - A curated list of modern Generative Artificial Intelligence projects and services.
  • Generative AI for Beginners - Course on generative AI for beginners from Microsoft.
  • Awesome AI Agents - A curated list of AI autonomous agents, environments, and frameworks.
  • AI Collection - The Generative AI Landscape - A Collection of Awesome Generative AI Applications.
  • Awesome AI Apps - A collection of projects showcasing RAG, agents, workflows, and other AI use cases.
  • System Prompts and Models - System Prompts, Internal Tools & AI Models from various AI applications and coding tools.
  • Awesome LangChain - Awesome list of tools and projects with the awesome LangChain framework.
  • Awesome AI Tools - A curated list of Artificial Intelligence Top Tools.
  • Awesome LLM Security - A curation of awesome tools, documents and projects about LLM Security.

⬆ back to top


Tools

A collection of frameworks, platforms, and end-user applications for building and deploying AI-powered solutions.

AI Agents & Automation

  • n8n - Workflow automation platform for connecting APIs and services.
  • crewAI - Framework for orchestrating role-playing AI agents.
  • autogen - Framework for building multi-agent conversational systems.
  • AutoGPT - Autonomous AI agent that can complete complex tasks.
  • LangGraph - Framework for building stateful, multi-actor applications with LLMs, with cycles and control flow.

Development Frameworks & Tools

  • LangChain - Framework for developing applications powered by language models.
  • LlamaIndex - Data framework for LLM-based applications with RAG capabilities.
  • openai-python - Official Python library for OpenAI API.
  • openai-agents-python - Official OpenAI framework for building AI agents.
  • ragflow - Open-source RAG (Retrieval-Augmented Generation) workflow platform.
  • firecrawl - Web crawling and data extraction service for AI applications.
  • Fabric - Framework for augmenting humans using AI.

Code Generation & Assistance

  • gpt-engineer - AI-powered code generation tool.
  • gpt-pilot - AI pair programmer that writes entire applications.
  • tabby - Self-hosted AI coding assistant.

Model Deployment & Platforms

  • Ollama - Tool for running large language models locally.
  • OpenLLM - Open platform for operating large language models in production.
  • LocalAI - Self-hosted, local-first AI model deployment platform.
  • dify - Visual LLM application development platform.
  • LLaMA-Factory - Easy-to-use LLM fine-tuning framework.

End-User Applications

  • open-webui - Web interface for interacting with various LLMs.
  • ComfyUI - Visual node-based interface for Stable Diffusion.
  • lobe-chat - Modern AI conversation interface.
  • LibreChat - Open-source ChatGPT alternative.
  • quivr - Personal second brain and AI assistant.
  • upscayl - AI-powered image upscaling tool.
  • facefusion - AI face swapping and enhancement tool.
  • DocsGPT - Documentation-based question answering system.
  • Whisper - Robust speech recognition model for transcription and translation.

⬆ back to top


☁️ Cloud Platforms & Infrastructure

Resources

A collection of resources for mastering cloud-native technologies, containerization, and infrastructure management.

⬆ back to top


Tools

Tools for containerization, orchestration, infrastructure as code, and cloud-native development.

Containerization & Orchestration

  • Docker - Open platform for developing, shipping, and running applications in containers.
  • Docker Compose - A tool for defining and running multi-container Docker applications.
  • Kubernetes - Production-grade container orchestration system.
  • Kompose - Conversion tool from Docker Compose to Kubernetes.

Infrastructure as Code

  • Terraform - Infrastructure as Code tool.
  • OpenTofu - Open source fork of Terraform.
  • Pulumi - Modern IaC platform using familiar programming languages.
  • CDK8s - Define Kubernetes apps using familiar languages.

CI/CD & GitOps

  • Jenkins - Open source automation server.
  • Argo CD - Declarative GitOps continuous delivery.
  • Argo Workflows - Container-native workflow engine.
  • Tekton - Kubernetes-native CI/CD framework.
  • Spinnaker - Multi-cloud continuous delivery.
  • Dagger - Portable devkit for CI/CD pipelines.

Service Mesh & API Gateways

  • Traefik - Modern HTTP reverse proxy and load balancer.
  • Kong - Cloud-native API Gateway.
  • Apache APISIX - Dynamic API gateway.
  • Envoy Gateway - Manages Envoy Proxy as gateway.
  • Higress - Cloud-native API gateway based on Istio.
  • Meshery - Service mesh management.

Kubernetes Ecosystem

  • Helm - Package manager for Kubernetes.
  • Kustomize - Configuration customization for Kubernetes.
  • Kubernetes Dashboard - Web-based UI for Kubernetes.
  • Skaffold - Continuous development for Kubernetes.
  • Tilt - Local development for Kubernetes.
  • Flagger - Progressive delivery operator.
  • KubeVela - Application delivery platform.
  • KubeSphere - Kubernetes multi-cloud management.

Developer Platforms & Control Planes

⬆ back to top


⚑ Productivity

Resources

A collection of resources to enhance productivity.

  • Positron - A next-generation data science IDE.
  • Nanobrowser - An open-source AI web automation tool with multi-agent system that runs directly in your browser.
  • Best of Jupyter - Ranked list of notable Jupyter Notebook, Hub, and Lab projects.
  • Notion - An all-in-one workspace for note-taking and task management.
  • Trello - A visual project management tool.
  • ChatGPT Data Science Prompts - A collection of useful prompts for data scientists using ChatGPT.
  • Cookiecutter Data Science - A standardized project structure for data science projects.
  • The Markdown Guide - Comprehensive guide to learning Markdown.
  • Readme-AI - A tool to automatically generate README.md files for your projects.
  • Markdown Here - Extension for writing emails in Markdown and rendering them before sending.
  • Habitica - A habit-building and productivity app that treats your life like a role-playing game.
  • Microsoft To Do - A simple to-do list app from Microsoft.
  • Google Keep - A note-taking and list-making app.
  • Bujo - Tools to help transform the way you work and live.
  • Parabola - An AI-powered workflow builder for organizing data.
  • Asana - A project management platform for tracking work and projects.
  • Puter - An open-source, browser-based computing environment and cloud OS.

⬆ back to top


Useful Linux Tools

A selection of tools to enhance productivity and functionality in Linux environments.

  • tldr-pages - Simplified and community-driven man pages with practical examples.
  • Bat - Cat clone with syntax highlighting.
  • Exa - Modern replacement for ls.
  • Ripgrep - Faster grep alternative.
  • Zoxide - Smarter cd command.
  • Peek - Simple animated GIF screen recorder with an easy to use interface.
  • CopyQ - Clipboard manager with advanced features.
  • Translate Shell - Command-line translator using Google Translate, Bing Translator, Yandex.Translate, etc.
  • Espanso - Cross-platform Text Expander written in Rust.
  • Flameshot - Powerful yet simple to use screenshot software.
  • DrawIO Desktop - An open-source diagramming software for making flowcharts, process diagrams, and more.
  • Inkscape - A powerful, free, and open-source vector graphics editor for creating and editing visualizations.
  • Rclone - A command-line program to manage files on cloud storage.
  • Rsync - A fast and versatile file copying tool that can synchronize files and directories between two locations over a network or locally.
  • Timeshift - System restore tool for Linux that creates filesystem snapshots using rsync+hardlinks or BTRFS snapshots.
  • Backintime - A comfortable and well-configurable graphical frontend for incremental backups.
  • Fzf - A command-line fuzzy finder.
  • Osquery - SQL powered operating system instrumentation, monitoring, and analytics.
  • GNU Parallel - A tool to run jobs in parallel.
  • HTop - An interactive process viewer.
  • Ncdu - A disk usage analyzer with an ncurses interface.
  • Thefuck - A command line tool to correct your previous console command.
  • Miller - A tool for querying, processing, and formatting data in various file formats (CSV, JSON, etc.), like awk/sed/cut for data.
  • jq - Command-line JSON processor for parsing and manipulating JSON data.
  • yq - Portable command-line YAML processor (like jq for YAML and XML).
  • q - Run SQL directly on CSV or TSV files from the command line.
  • VisiData - Interactive multitool for tabular data exploration in the terminal.
  • csvkit - Suite of command-line tools for working with CSV data.
  • httpie - Modern command-line HTTP client for API testing and debugging.
  • glances - Cross-platform system monitoring tool for resource usage analysis.
  • hyperfine - Command-line benchmarking tool for performance testing.
  • termgraph - Draw basic graphs in the terminal for quick data visualization.
  • fd - Simple, fast and user-friendly alternative to 'find'.
  • dust - More intuitive version of du written in rust.
  • bottom - Cross-platform graphical process/system monitor.

⬆ back to top


Useful VS Code Extensions

A collection of extensions to enhance functionality and productivity in Visual Studio Code.

⬆ back to top


πŸ“š Skill Development & Career

Practice Resources

A collection of resources to enhance skills and advance your career in data analysis and related fields.

⬆ back to top


Curated Jupyter Notebooks

A selection of curated Jupyter notebooks to support learning and exploration in data science and analysis.

⬆ back to top


Data Sources & Datasets

A collection of resources for accessing datasets and data sources for analysis and projects.

  • Kaggle Datasets - Extensive collection of datasets for practice in data analysis.
  • Opendatasets - A Python library for downloading datasets from Kaggle, Google Drive, and other online sources.
  • Datasette - An open source multi-tool for exploring and publishing data.
  • Awesome Public Datasets - Curated list of high-quality open datasets.
  • Open Data Sources - Collection of various open data sources.
  • Free Datasets for Projects - Dataquest's compilation of free datasets.
  • Data World - The enterprise data catalog that CIOs, governance professionals, data analysts, and engineers trust in the AI era.
  • Awesome Public Real Time Datasets - A list of publicly available datasets with real-time data.
  • Google Dataset Search - A search engine for datasets from across the web.
  • NASA Open Data Portal - A site for NASA's open data initiative, providing access to NASA's data resources.
  • The World Bank Data - Free and open access to global development data by The World Bank.
  • Voice Datasets - A collection of audio and speech datasets for voice AI and machine learning.
  • HuggingFace Datasets - A lightweight library to easily share and access datasets for audio, computer vision, and NLP.
  • TensorFlow Datasets - A collection of ready-to-use datasets for use with TensorFlow and other Python ML frameworks.
  • NLP Datasets - A curated list of datasets for natural language processing (NLP) tasks.
  • TorchVision Datasets - The torchvision.datasets module provides many built-in computer vision datasets.
  • LLM Datasets - A collection of datasets and resources for training and fine-tuning Large Language Models (LLMs).
  • Unsplash Datasets - A collection of datasets from Unsplash, useful for computer vision and research.
  • Awesome JSON Datasets - A curated list of awesome JSON datasets that are publicly available without authentication.

⬆ back to top


Resume and Interview Tips

A variety of resources to help you prepare for interviews and enhance your resume.

⬆ back to top


πŸ“‹ Cheatsheets

A collection of cheatsheets across various domains to aid in quick reference and learning.

GoalKicker Programming Notes

⬆ back to top


Python

⬆ back to top


Data Science & Machine Learning

⬆ back to top


Linux & Git

⬆ back to top


Probability & Statistics

⬆ back to top


SQL & Databases

⬆ back to top


Miscellaneous

⬆ back to top


πŸ“¦ Additional Python Libraries

A collection of supplementary Python libraries that enhance development workflow, automate processes, and maintain project quality beyond core data analysis tools.

Code Quality & Development

  • Black - Uncompromising Python code formatter.
  • Pre-commit - Framework for managing pre-commit hooks.
  • Pylint - Python code static analysis.
  • Mypy - Optional static typing for Python.
  • Rich - Rich text and beautiful formatting in the terminal.
  • Icecream - Debugging without using print.
  • Pandas-log - Logs pandas operations for data transformation tracking.
  • PandasVet - Code style validator for Pandas.
  • Pydeps - Python module dependency graphs.
  • PyForest - Automated Python imports for data science.

⬆ back to top


Documentation & File Processing

  • Sphinx - Documentation generator.
  • Pdoc - API documentation for Python projects.
  • Mkdocs - Project documentation with Markdown.
  • OpenPyXL - Read/write Excel files.
  • Tablib - Exports data to XLSX, JSON, CSV.
  • PyPDF2 - Reads and writes PDF files.
  • Python-docx - Reads and writes Word documents.
  • CleverCSV - Smart CSV reader for messy data.
  • Python-markdownify - Convert HTML to Markdown.
  • Xlwings - Integration of Python with Excel.
  • Xmltodict - Converts XML to Python dictionaries.
  • MarkItDown - Python tool for converting files and office documents to Markdown.
  • Jupyter-book - Build publication-quality books from Jupyter notebooks.
  • WeasyPrint - Convert HTML to PDF.
  • PyMuPDF - Advanced PDF manipulation library.
  • Camelot - PDF table extraction library.

⬆ back to top


Web & APIs

  • HTTPX - Next-generation HTTP client for Python.
  • FastAPI - Modern web framework for building APIs.
  • Typer - Library for building CLI applications.
  • Requests-cache - Persistent caching for requests library.

⬆ back to top


Miscellaneous

  • UV - An extremely fast Python package installer and resolver.
  • Funcy - Fancy functional tools for Python.
  • Pillow - Image processing library.
  • Ftfy - Fixes broken Unicode strings.
  • JmesPath - Queries JSON data (SQL-like for JSON).
  • Glom - Transforms nested data structures.
  • Diagrams - Diagrams as code for cloud architecture.
  • Pytest - Framework for writing small tests.
  • Pampy - Pattern matching for Python dictionaries.
  • Pygorithm - A Python module for learning all major algorithms.
  • GitPython - A Python library used to interact with Git repositories.
  • TQDM - Progress bars for loops and operations.
  • Loguru - Python logging made simple.
  • Click - Beautiful command line interfaces.
  • Poetry - Python dependency management and packaging.
  • Hydra - Elegant configuration management.

⬆ back to top


πŸ“ More Awesome Lists

A curated list of other awesome lists on various topics and technologies.

⬆ back to top


🌐 Additional Resources

A wide range of resources designed to facilitate learning, development, and exploration across different domains.

  • UC Berkeley - Data 8 - Course materials for the Data Science Foundations course.
  • A collective list of free APIs - A comprehensive list of free APIs for various purposes.
  • arXiv.org - A free distribution service and open-access archive for scholarly articles.
  • Elicit - An AI research assistant that helps automate parts of literature review.
  • 500+ AI/ML/DL/NLP Projects - A massive collection of AI and machine learning projects with code for learning and portfolios.
  • Kittl - Platform for creating and editing charts and data visualizations.
  • Zasper - High Performace IDE for Jupyter Notebooks.
  • Sketch - Toolkit designed for designers, focusing on their workflow.
  • Growth.Design - A collection of product case studies and behavioral psychology insights for data-driven decision-making.

⬆ back to top


🀝 Contributing

We welcome your contributions!

See CONTRIBUTING.md for how to add resources.

⬆ back to top


πŸ“œ License

CC0

This work is dedicated to the public domain under the CC0 1.0 Universal license.

⬆ back to top