Skip to content
View mykolamelnykml's full-sized avatar

Block or report mykolamelnykml

Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
mykolamelnykml/README.md

Greetings! πŸ‘‹

My name is Mykola Melnyk, and I'm an ML expert with two decades of experience in the software development. I specialize in transforming complex business ideas into scalable, secure, and efficient AI-driven products. I have expert knowledge in various areas, enabling me to deliver cutting-edge, top-tier AI solutions that drive business growth and improve efficiency.

Key Areas of My Specialization:

πŸ“„ Natural Language Processing (NLP), Computer Vision (CV), and Optical Character Recognition (OCR): 5+ years of experience in document processing, understanding, and anonymization. Led the development of Spark OCR (Visual NLP) using technologies such as Python/Scala, PySpark, PyTorch, LLMs, LLama 3, Mini Gemini, LangChain, and Hugging Face Transformers.

⚑ Big Data Processing with Apache Spark: 7+ years of experience designing and optimizing large-scale data pipelines for high-performance processing. In-depth knowledge of Spark internals, Spark Structured Streaming, and creator/contributor to the open-source spark-pdf datasource project written in Scala, enhancing Spark’s capabilities.

πŸ”’ Data De-identification & Anonymization: Expert in anonymizing sensitive data from text, images, PDFs, and DICOM files. I ensure privacy, security, and compliance with GDPR and HIPAA standards using NLP, OCR, and computer vision to remove or mask personal information, safeguarding data confidentiality.

🧬 Healthcare, Pharma, MedTech, BioTech Expertise: Over 5 years of experience in the healthcare and life sciences sectors, with a strong understanding of formats like DICOM, and expertise in delivering solutions specifically tailored to meet the unique needs of these industries.

TOP 5 Reasons to Work With Me

βœ… End-to-End Expertise

βœ… Complex Problem-Solving Ability

βœ… Timely Delivery

βœ… Transparent Communication

βœ… Scalable Solutions

Professional Skills

πŸ› οΈ Programming Languages: Python, Scala

πŸ“Š Data Science & Machine Learning: NLP, Computer Vision, Large Language Models (LLMs), Optical Character Recognition (OCR), Model Productionalization, Deep Learning (PyTorch, TensorFlow, Hugging Face Transformers, ONNX, Pandas, CLIP)

πŸ’‘ LLMs and Related Tools: OpenAI GPT, Gemini, Llama 3, FLUX, Together.ai, Ollama, Hugging Face, Langchain, LlamaIndex, LangServe, LangGraph, QLORA, Streamlit, Gradio

⚑ Big Data & Distributed Systems: Big Data Processing, ETL, Stream Processing, Real-Time Aggregation, Apache Spark (PySpark, Spark ML, Spark Structured Streaming), Kinesis, Kafka, Databricks

πŸš€ Cloud Computing & Infrastructure: Amazon Web Services (AWS), Distributed Systems, CI/CD Pipelines, Docker, Jenkins, Graphite, Grafana, Elasticsearch, Kibana

βš™οΈ Databases: PostgreSQL, MongoDB, Redis, DynamoDB

πŸ’Ό CRMs: Hubspot, ZohoCRM

Availability

Committed to long-term collaborations. Available full-time for your next project.

My Projects

Spark PDF DataSource

Spark Pdf


Source Code: https://github.com/StabRise/spark-pdf

Home page: https://stabrise.com/spark-pdf/

Quick Start Jupyter Notebook: PdfDataSource.ipynb


The project provides a custom data source for the Apache Spark that allows you to read PDF files into the Spark DataFrame.

Key features:

  • Read PDF documents to the Spark DataFrame
  • Support read PDF files lazy per page
  • Support big files, up to 10k pages
  • Support scanned PDF files (call OCR)
  • No need to install Tesseract OCR, it's included in the package

ScaleDP

ScaleDP


Source Code: https://github.com/StabRise/scaledp

Home page: https://stabrise.com/scaledp/

Quick Start Jupyter Notebook: https://github.com/StabRise/ScaleDP-Tutorials/blob/master/1.QuickStart.ipynb


ScaleDP is an Open-Source Library for processing documents using Apache Spark.

Key features:

  • Load PDF documents/Images
  • Extract text from PDF documents/Images
  • Extract images from PDF documents
  • OCR Images/PDF documents
  • Run NER on text extracted from PDF documents/Images
  • Visualize NER results

Github

Mykola's GitHub stats

Pinned Loading

  1. StabRise/spark-pdf StabRise/spark-pdf Public

    PDF DataSource for Apache Spark, allow to read PDF files directly to the DataFrame and ocr it

    Scala 67 4

  2. StabRise/ScaleDP StabRise/ScaleDP Public

    ScaleDP is an Open-Source extension of Apache Spark for Document Processing

    Python 11