Skip to content
View pjox's full-sized avatar
Drinking coffee
Drinking coffee

Sponsoring

@typst

Highlights

  • Pro

Organizations

@commoncrawl @bigscience-workshop @oscar-project

Block or report pjox

Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
pjox/README.md

Hi there 👋

I'm a Principal Research Scientist at the Common Crawl Foundation.

I am interested in large corpora for training language models, specially for under resourced languages and historical languages. I am interested in tasks such as Name Entity Recognition (NER), Dependency Parsing and Part-of-Speech tagging, Machine Translation and Document structuration.

I love coffee ☕️, cookies 🍪 and maths.

Pinned Loading

  1. commoncrawl/cc-downloader commoncrawl/cc-downloader Public

    A polite and user-friendly downloader for Common Crawl data

    Rust 64 4

  2. oscar-utils oscar-utils Public

    A new set of utilities to work with the OSCAR Corpus

    Rust 2

  3. oscar2parquet oscar2parquet Public

    Converts OSCAR's jsonl files into parquet

    Rust 2

  4. oscar-project/ungoliant oscar-project/ungoliant Public

    🕷️ The pipeline for the OSCAR corpus

    Rust 174 17

  5. oscar-project/goclassy oscar-project/goclassy Public archive

    An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline.

    Go 86 6

  6. oscar-project/oscar-website oscar-project/oscar-website Public

    The website of the Oscar Project

    TeX 11 13