Skip to content

oyi77/OpenMedallion-Dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

language
en
license mit
task_categories
text-generation
tabular-regression
time-series-forecasting
tags
finance
trading
quantitative
crypto
stocks
forex
macroeconomic
defi
options
machine-learning
time-series
arbitrage
sentiment
pretty_name OpenMedallion Financial Dataset
size_categories
1M<n<10M

OpenMedallion Financial Dataset

Comprehensive financial dataset for quantitative research and machine learning — 4.6M+ rows across 54 parquet files covering 22 domains including equities, forex, crypto, macroeconomics, DeFi, options, sentiment, and AI training data.


Dataset Summary

Metric Value
Total Files 54 parquet files
Total Rows ~4,610,000
Total Size ~370 MB
Domains 22 financial categories
License MIT
Format Apache Parquet (columnar)

File Inventory

Macro & Economics

File Rows Columns Size Description
fred_mega.parquet 426,432 9 2.0 MB 133 FRED economic indicators (GDP, CPI, unemployment, rates), 1920s–present
fred_extended.parquet 6,013 4 0.04 MB Extended FRED series with additional indicators
macro_indicators_extended.parquet 69,473 4 0.5 MB Global macro indicators (PMI, industrial production, trade)
treasury_yields_full.parquet 197,340 5 0.7 MB US Treasury yield curve across all maturities
vix_volatility.parquet 49,224 8 1.0 MB VIX index, volatility surface, term structure
google_trends_finance.parquet 8,070 4 0.02 MB Google Trends search interest for financial terms

Equities

File Rows Columns Size Description
stocks_sp500_nasdaq_25yr.parquet 994,763 11 26.8 MB S&P 500 and NASDAQ daily OHLCV, 25 years
stocks_global_25yr.parquet 366,597 10 11.0 MB Global equities across 50+ exchanges, 25 years
stocks_extended_1.parquet 100,000 12 2.4 MB Extended stock data — fundamentals + technicals (part 1)
stocks_extended_2.parquet 100,000 12 2.6 MB Extended stock data — fundamentals + technicals (part 2)
stocks_extended_3.parquet 94,304 12 3.2 MB Extended stock data — fundamentals + technicals (part 3)
financedb_equities.parquet 160,126 21 31.1 MB FinanceDatabase equities — 160K securities with metadata
financedb_funds.parquet 57,853 9 5.1 MB Mutual funds, ETFs, and index funds
financedb_indices.parquet 91,181 8 3.0 MB Global market indices
financedb_moneymarkets.parquet 1,367 7 0.05 MB Money market instruments
dividends_full.parquet 9,483 4 0.06 MB Historical dividend payments
analyst_recommendations.parquet 268 8 0.01 MB Wall Street analyst buy/sell/hold ratings
insider_transactions.parquet 4,404 11 0.09 MB SEC Form 4 insider trading filings
sec_edgar_filings.parquet 388 6 0.01 MB SEC EDGAR filing metadata

Forex

File Rows Columns Size Description
forex_expanded.parquet 238,675 8 6.0 MB 60+ currency pairs, daily OHLCV
financedb_currencies.parquet 2,556 7 0.14 MB Currency pairs with metadata from FinanceDatabase

Crypto & Exchange Data

File Rows Columns Size Description
crypto_expanded.parquet 105,374 9 4.6 MB Top 50 crypto assets, daily OHLCV + volume
binance_klines_1d.parquet 10,000 13 0.7 MB Binance 1-day candlestick data
hf_5min.parquet 1,000 10 0.05 MB BTC/ETH 5-minute OHLCV bars
financedb_crypto.parquet 3,367 7 0.1 MB Crypto assets with metadata from FinanceDatabase
open_interest.parquet 310 5 0.01 MB Futures open interest across exchanges
long_short_ratio.parquet 180 7 0.01 MB Top trader long/short positioning

DeFi & On-Chain

File Rows Columns Size Description
defillama_pools.parquet 5,000 14 0.35 MB DeFi protocol pools from DefiLlama (TVL, APY, chain)
onchain_metrics.parquet 14,137 4 0.13 MB BTC/ETH on-chain metrics (hash rate, active addresses, fees)
dex_volumes.parquet 500 13 0.04 MB Decentralized exchange trading volumes

Arbitrage

File Rows Columns Size Description
cross_cex_arbitrage.parquet 84 13 0.01 MB Cross-exchange arbitrage opportunities (CEX-CEX)
dex_dex_arbitrage.parquet 408 11 0.03 MB DEX-DEX arbitrage spread data

Commodities

File Rows Columns Size Description
commodities_expanded.parquet 124,370 9 2.4 MB Gold, silver, oil, natural gas, agricultural commodities

Options & Volatility

File Rows Columns Size Description
options_data.parquet 1,648 14 0.05 MB Deribit BTC/ETH options chain with greeks

Taiwan & International Markets

File Rows Columns Size Description
finmind_tw_stocks.parquet 194,210 10 4.5 MB Taiwan Stock Exchange daily data from FinMind
finmind_tw_futures.parquet 50,289 9 0.6 MB Taiwan futures market data
finmind_us_stocks.parquet 118,005 9 0.12 MB US stock historical data via FinMind

Sentiment & News

File Rows Columns Size Description
news_sentiment.parquet 100 8 0.01 MB Financial news with sentiment scores (Twitter/X, Reddit)
shipping_sentiment.parquet 9,807 8 0.33 MB Global shipping and trade sentiment indicators
fingpt_ner.parquet 511 4 0.07 MB Financial named entity recognition dataset

Yahoo Finance

File Rows Columns Size Description
yahoo_indices_etfs.parquet 280,348 9 8.3 MB Major indices and ETFs from Yahoo Finance

AI/ML Training Data

File Rows Columns Size Description
finance_instruct_full_1.parquet 100,000 4 103.4 MB Financial instruction-following dataset (part 1)
finance_instruct_full_2.parquet 100,000 4 20.4 MB Financial instruction-following dataset (part 2)
finance_instruct_full_3.parquet 100,000 4 22.6 MB Financial instruction-following dataset (part 3)
finance_instruct_full_4.parquet 100,000 4 36.2 MB Financial instruction-following dataset (part 4)
finance_instruct_full_5.parquet 100,000 4 39.2 MB Financial instruction-following dataset (part 5)
finance_instruct_full_6.parquet 18,185 4 8.2 MB Financial instruction-following dataset (part 6)
finance_instruct_500k.parquet 50,001 4 42.6 MB 500K finance instruction samples (consolidated)
finance_alpaca_full_1.parquet 68,912 5 21.3 MB Finance Alpaca-format instruction tuning data
financial_qa_10k.parquet 7,000 6 1.5 MB 10K financial Q&A pairs
hf_financebench_qa.parquet 150 15 0.29 MB FinanceBench benchmark Q&A
hf_ktx_finance.parquet 1,124 2 0.05 MB Finance knowledge triple extraction
hf_microstructure.parquet 3,000 12 0.18 MB Market microstructure training data
hf_trading_reasoning.parquet 1,139 10 3.3 MB Chain-of-thought trading reasoning samples

Quick Start

Python (pandas)

import pandas as pd

# Load from HuggingFace
from datasets import load_dataset

# Or load individual files directly
fred = pd.read_parquet("hf://datasets/paijo77/OpenMedallion/data/fred_mega.parquet")
stocks = pd.read_parquet("hf://datasets/paijo77/OpenMedallion/data/stocks_sp500_nasdaq_25yr.parquet")
crypto = pd.read_parquet("hf://datasets/paijo77/OpenMedallion/data/crypto_expanded.parquet")

Python (polars)

import polars as pl

df = pl.read_parquet("hf://datasets/paijo77/OpenMedallion/data/hf_5min.parquet")
print(df.select(["symbol", "close"]).filter(pl.col("symbol") == "BTCUSDT").head())

HuggingFace datasets library

from datasets import load_dataset

ds = load_dataset("paijo77/OpenMedallion")
print(ds)

Data Sources

Source URL Coverage
FRED https://fred.stlouisfed.org 100+ years of US macro data
FinMind https://finmindtrade.com Taiwan/US stocks, futures
FinanceDatabase https://github.com/jerBAllen/FinanceDatabase 300K+ financial instruments
Binance https://binance-docs.github.io/apidocs Crypto candlestick data
Deribit https://docs.deribit.com Options chain, funding rates
DefiLlama https://defillama.com DeFi protocol TVL and pools
Yahoo Finance https://finance.yahoo.com Indices, ETFs, equities
SEC EDGAR https://www.sec.gov/edgar Corporate filings
Etherscan https://etherscan.io/apis Ethereum on-chain data
Google Trends https://trends.google.com Search interest trends

Citation

@dataset{openmedallion2026,
  title     = {OpenMedallion Financial Dataset},
  author    = {BerkahKarya},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/datasets/paijo77/OpenMedallion},
  license   = {MIT}
}

License

This dataset is released under the MIT License.


Disclaimer

This is financial data for research and educational purposes. It is not investment advice. Always backtest strategies thoroughly before deploying with real capital. Past performance does not guarantee future results.

About

🧠 The Mad Researcher's Financial Intelligence Dataset — 31 files, 22 domains, 2.8M+ rows, 99 years of data (1927-2026)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages