Performance Optimization Techniques

Explore top LinkedIn content from expert professionals.

  • View profile for Zach Wilson
    Zach Wilson Zach Wilson is an Influencer

    Founder @ DataExpert.io | ex-Netflix ex-Meta staff engineer | Angel Investor in 6 startups | Featured on Forbes | Dogs

    514,881 followers

    Apache Spark has levels to it: - Level 0 You can run spark-shell or pyspark, it means you can start - Level 1 You understand the Spark execution model: • RDDs vs DataFrames vs Datasets • Transformations (map, filter, groupBy, join) vs Actions (collect, count, show) • Lazy execution & DAG (Directed Acyclic Graph) Master these concepts, and you’ll have a solid foundation - Level 2 Optimizing Spark Queries • Understand Catalyst Optimizer and how it rewrites queries for efficiency. • Master columnar storage and Parquet vs JSON vs CSV. • Use broadcast joins to avoid shuffle nightmares • Shuffle operations are expensive. Reduce them with partitioning and good data modeling • Coalesce vs Repartition—know when to use them. • Avoid UDFs unless absolutely necessary (they bypass Catalyst optimization). Level 3 Tuning for Performance at Scale • Master spark.sql.autoBroadcastJoinThreshold. • Understand how Task Parallelism works and set spark.sql.shuffle.partitions properly. • Skewed Data? Use adaptive execution! • Use EXPLAIN and queryExecution.debug to analyze execution plans. - Level 4 Deep Dive into Cluster Resource Management • Spark on YARN vs Kubernetes vs Standalone—know the tradeoffs. • Understand Executor vs Driver Memory—tune spark.executor.memory and spark.driver.memory. • Dynamic allocation (spark.dynamicAllocation.enabled=true) can save costs. • When to use RDDs over DataFrames (spoiler: almost never). What else did I miss for mastering Spark and distributed compute?

  • View profile for Andrew Ng
    Andrew Ng Andrew Ng is an Influencer

    DeepLearning.AI, AI Fund and AI Aspire

    2,404,680 followers

    Last week, I described four design patterns for AI agentic workflows that I believe will drive significant progress: Reflection, Tool use, Planning and Multi-agent collaboration. Instead of having an LLM generate its final output directly, an agentic workflow prompts the LLM multiple times, giving it opportunities to build step by step to higher-quality output. Here, I'd like to discuss Reflection. It's relatively quick to implement, and I've seen it lead to surprising performance gains. You may have had the experience of prompting ChatGPT/Claude/Gemini, receiving unsatisfactory output, delivering critical feedback to help the LLM improve its response, and then getting a better response. What if you automate the step of delivering critical feedback, so the model automatically criticizes its own output and improves its response? This is the crux of Reflection. Take the task of asking an LLM to write code. We can prompt it to generate the desired code directly to carry out some task X. Then, we can prompt it to reflect on its own output, perhaps as follows: Here’s code intended for task X: [previously generated code] Check the code carefully for correctness, style, and efficiency, and give constructive criticism for how to improve it. Sometimes this causes the LLM to spot problems and come up with constructive suggestions. Next, we can prompt the LLM with context including (i) the previously generated code and (ii) the constructive feedback, and ask it to use the feedback to rewrite the code. This can lead to a better response. Repeating the criticism/rewrite process might yield further improvements. This self-reflection process allows the LLM to spot gaps and improve its output on a variety of tasks including producing code, writing text, and answering questions. And we can go beyond self-reflection by giving the LLM tools that help evaluate its output; for example, running its code through a few unit tests to check whether it generates correct results on test cases or searching the web to double-check text output. Then it can reflect on any errors it found and come up with ideas for improvement. Further, we can implement Reflection using a multi-agent framework. I've found it convenient to create two agents, one prompted to generate good outputs and the other prompted to give constructive criticism of the first agent's output. The resulting discussion between the two agents leads to improved responses. Reflection is a relatively basic type of agentic workflow, but I've been delighted by how much it improved my applications’ results. If you’re interested in learning more about reflection, I recommend: - Self-Refine: Iterative Refinement with Self-Feedback, by Madaan et al. (2023) - Reflexion: Language Agents with Verbal Reinforcement Learning, by Shinn et al. (2023) - CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing, by Gou et al. (2024) [Original text: https://lnkd.in/g4bTuWtU ]

  • View profile for Ethan Evans
    Ethan Evans Ethan Evans is an Influencer

    Former Amazon VP, sharing High Performance and Career Growth insights. Outperform, out-compete, and still get time off for yourself.

    165,595 followers

    I've recently suffered a major career setback. Since I teach about high performance and career growth, I want to share how I am addressing it. One day you will need this recipe yourself! My goal in my current "career" is to reach as many people as I can, and to help them achieve career success and satisfaction. For the last three years, the way to do this has been through LinkedIn. Unfortunately, LinkedIn recently made some unknown changes to their algorithm. Other Top Voices and I have noticed a drop of 70% to 80% in the reach of our posts. Since my goal is to share my knowledge with more people, that means my goal just took an 80% hit. In general, setbacks in performance are either due to: A) Something we did Or B) Something external, outside our direct control Mistakes, poor decisions, and missed deadlines are examples of A. They are in our control. Things like Covid, high interest rates, and reorganizations at work are examples of B, outside our control. LinkedIn's change is also case B, outside my control. When a setback comes from something in your control, you know clearly what you did wrong and what you need to change to restore your performance and progress. Fixing your own issues may take time and be difficult, but you know what to do. When the setback is due to something outside your control, you do not know how to fix the issue. So, how can we react when our performance is shattered and we do not know why? Here is my recipe: 1. Allow yourself a fixed amount of time to grieve (and complain if you wish). Emotions are real, and before you can move on you will need to sit with those emotions. But, do not get stuck in them. Curse your bad luck, pout for a minute, etc. Then, move to the next step. 2. Refocus on your core value. Whatever happened, go back to how you define high performance to ensure it is still relevant. I admit, I slipped into defining my own performance by how many people viewed my LinkedIn posts. This was a mistake. My mission is to help others, so getting views is a proxy, not a result. And, using LinkedIn is just a method for the mission, not the mission itself. 3. Adapt your core value if you must (if its value has decreased). In my case, the value of what I offer hasn't changed, the external delivery system has. 4. Once you adapt and/or increase your value, find new ways to deliver it if necessary. Luckily, I have other options for reaching people: my Substack newsletter, YouTube, etc. Since Substack has been such a good partner recently, I will start there. I have also refocused how I write on LinkedIn to make every post focused on my goal. 5. Test, measure, adapt, repeat! Really, this step is everything. Once you get past the grief, jump into action in this loop. Nothing can stop you if you keep working to refine, deliver, and showcase your core value. Comments? Here's my newsletter, which is my next area of investment: https://lnkd.in/gXh2pdK2

  • View profile for Brij kishore Pandey
    Brij kishore Pandey Brij kishore Pandey is an Influencer

    AI Architect | AI Engineer | Generative AI | Agentic AI

    708,477 followers

    Many of us write SQL queries daily, but how often do we consider the underlying execution order? Understanding each step can be a game-changer for optimizing query performance and getting accurate results. Here’s a detailed walkthrough of SQL’s execution flow: 𝟭. 𝗙𝗥𝗢𝗠 𝗖𝗹𝗮𝘂𝘀𝗲: 𝗧𝗵𝗲 𝗦𝘁𝗮𝗿𝘁𝗶𝗻𝗴 𝗟𝗶𝗻𝗲      - Role: Establishes the data sources (tables, views, or joins) your query will work with.    - Why It Matters: The FROM clause is where it all begins. Selecting the right sources and structuring joins here determines the query’s foundation and efficiency. 𝟮. 𝗪𝗛𝗘𝗥𝗘 𝗖𝗹𝗮𝘂𝘀𝗲: 𝗧𝗵𝗲 𝗙𝗶𝗹𝘁𝗲𝗿 𝗚𝗮𝘁𝗲      - Role: Applies conditions to remove rows that don’t meet specified criteria.    - Why It Matters: Filtering data at this stage reduces the load for subsequent steps, saving processing time and ensuring only relevant data proceeds. 𝟯. 𝗚𝗥𝗢𝗨𝗣 𝗕𝗬 & 𝗔𝗴𝗴𝗿𝗲𝗴𝗮𝘁𝗶𝗼𝗻 (𝗢𝗽𝘁𝗶𝗼𝗻𝗮𝗹): 𝗖𝗮𝘁𝗲𝗴𝗼𝗿𝗶𝘇𝗶𝗻𝗴 𝗗𝗮𝘁𝗮      - GROUP BY: Clusters rows by specified columns, transforming raw data into grouped sets.    - Aggregate Functions (e.g., SUM, COUNT): Summarize each group’s data, converting details into insights.    - HAVING Clause: Filters these groups based on aggregate results.    - Why It Matters: Using GROUP BY and aggregation effectively is essential for summary reports. This step is powerful for analytics but can be resource-intensive if misused. 𝟰. 𝗦𝗘𝗟𝗘𝗖𝗧 𝗖𝗹𝗮𝘂𝘀𝗲: 𝗖𝗵𝗼𝗼𝘀𝗶𝗻𝗴 𝘁𝗵𝗲 𝗥𝗲𝘀𝘂𝗹𝘁𝘀      - Role: Specifies which columns or expressions appear in the final output.    - Did You Know? The SELECT clause runs after WHERE and GROUP BY, meaning you’re selecting columns from an already-filtered and grouped dataset.    - Why It Matters: This ensures that only the necessary columns make it to the final result, making the query efficient and clear. 𝟱. 𝗢𝗥𝗗𝗘𝗥 𝗕𝗬 & 𝗟𝗜𝗠𝗜𝗧: 𝗥𝗲𝗳𝗶𝗻𝗶𝗻𝗴 𝘁𝗵𝗲 𝗢𝘂𝘁𝗽𝘂𝘁      - ORDER BY: Sorts the results based on one or more columns, ideal for ordered reports and prioritized data.    - LIMIT: Caps the number of returned rows, especially useful for large datasets.    - Why It Matters: Ordering and limiting focus the output for user readability and system efficiency, especially when dealing with large datasets. Why Execution Order is Essential Knowing SQL’s execution sequence helps you: - 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗲 𝗣𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲: Each step can be streamlined to make queries faster and more responsive. - 𝗧𝗿𝗼𝘂𝗯𝗹𝗲𝘀𝗵𝗼𝗼𝘁 𝗜𝘀𝘀𝘂𝗲𝘀 𝗘𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝘁𝗹𝘆: By understanding the order, you can pinpoint issues at specific steps. - 𝗥𝗲𝗱𝘂𝗰𝗲 𝗥𝗲𝘀𝗼𝘂𝗿𝗰𝗲 𝗨𝘀𝗮𝗴𝗲: Targeted optimization in each clause saves both time and computational power. 𝗣𝗿𝗼 𝗧𝗶𝗽: Different SQL dialects (MySQL, SQL Server, Oracle) can vary in execution quirks, so always refer to your database documentation for precise optimization techniques. What’s your top SQL tip for query performance?  👇

  • View profile for Sebastian Raschka, PhD
    Sebastian Raschka, PhD Sebastian Raschka, PhD is an Influencer

    ML/AI research engineer. Author of Build a Large Language Model From Scratch (amzn.to/4fqvn0D) and Ahead of AI (magazine.sebastianraschka.com), on how LLMs work and the latest developments in the field.

    221,993 followers

    My next tutorial on pretraining an LLM from scratch is now out. It starts with a step-by-step walkthrough of understanding, calculating, and optimizing the loss. After training, we update the text generation function with temperature scaling and top-k sampling. And finally, we also load openly available pretrained weights into our scratch-built model architecture. Along with this pretraining tutorial, I also have bonus material on speeding up the LLM training. These apply not just to LLMs but also to other transformer-based models like vision transformers: 1. Instead of saving the causal mask, this creates the causal mask on the fly to reduce memory usage (here it has minimal effect, but it can add up in long-context size models like Llama 3.2 with 131k-input-tokens support) 2. Use tensor cores (only works for Ampere GPUs like A100 and newer) 3. Use the fused CUDA kernels for `AdamW` by setting 4. Pre-allocate and re-use GPU memory via the pinned memory setting in the data loader 5. Switch from 32-bit float to 16-bit brain float (bfloat16) precision 6. Replace from-scratch implementations of attention mechanisms, layer normalizations, and activation functions with PyTorch counterparts that have optimized CUDA kernels 7. Use FlashAttention for more efficient memory read and write operations 8. Compile the model 9. Optimize the vocabulary size 10. After saving memory with the steps above, increase the batch size Video tutorial: https://lnkd.in/gDRycWea PyTorch speed-ups: https://lnkd.in/gChvGCJH

  • View profile for Nishant Kumar

    Data Engineer @ IBM | 100K+ Audience | • SQL • PySpark • Airflow • AWS • Databricks • Snowflake • Kafka | AWS & Databricks Certified | Scalable Data Pipelines & Data Lakehouse | 650+ Mentorships Delivered

    105,261 followers

    This PySpark job was running for 2 hours. I brought it down to 15 mins And no, I didn’t just throw more clusters at it Here’s what really made the difference Context: We had a pipeline processing millions of rows — complex joins, multiple transformations, and writing to S3 Every day, it was eating up ~2 hours, and slowing down downstream processes What I did: 𝐒𝐭𝐞𝐩 1: Avoided shuffles wherever possible → Rewrote wide transformations like groupBy and join using efficient partitioning strategies 𝐒𝐭𝐞𝐩 2: Broadcast Joins → Replaced regular joins with broadcast joins for smaller dimension tables. Saved huge shuffle time 𝐒𝐭𝐞𝐩 3: Used .select() smartly → Trimmed down the DataFrame early. No need to carry unused columns throughout 𝐒𝐭𝐞𝐩 4: Cached intermediate DataFrames → Especially after expensive operations used multiple times 𝐒𝐭𝐞𝐩 5: Repartitioned before write → Controlled file sizes for optimized parallel writes to S3 Result? - From 2 hours → 15 minutes - Same data, same cluster, smarter code - That’s the power of PySpark when used right Have you faced performance issues in Spark jobs too? Drop a “Yes” and I’ll share my performance tuning checklist 💡 𝐏𝐫𝐞𝐩𝐚𝐫𝐞 𝐟𝐨𝐫 𝐈𝐧𝐭𝐞𝐫𝐯𝐢𝐞𝐰: https://lnkd.in/gUEVYCGy 𝐉𝐨𝐢𝐧 𝐦𝐞: https://lnkd.in/giE3e9yH #DataEngineering #PySpark #PerformanceTuning #AWS

  • View profile for Arpit Bhayani
    Arpit Bhayani Arpit Bhayani is an Influencer
    270,188 followers

    Today, I was reading through PostgreSQL's documentation and I stumbled upon something interesting called the `INCLUDE` clause. Let’s dig deeper. The queries you fire on the database will execute the fastest if your database can resolve the queries just by using indexes. Such indexes are called covering indexes. To achieve this, we almost always add all the columns in the index and create a composite index. For example, an index on `(author, published_at)` on the `blogs` table will be efficient in getting blogs by author ordered by published time. Given that we will almost always need the `title` of the blog while rendering the list, it makes sense to make it part of the index. However, adding `title` in the index key would make the index less efficient and affect the uniqueness. Essentially, we need the index key as `(author, published_at)` but also include the `title` column in the index. This is where PostgreSQL's `INCLUDE` clause comes in pretty handy. The `INCLUDE` clause allows you to add extra columns to the index that are not part of the index's key but are included for efficiency in certain queries. These extra columns make the index `covering` and hence some queries can complete their execution without needing to access the actual table. ```sql CREATE INDEX idx_blogs_author_include ON blogs (author, published_at) INCLUDE (title); ``` The above query creates the index that stores `(author, published_at)` for fast lookup and includes `title` so that if a query requires all three, PostgreSQL doesn't have to retrieve the `title` from the main table. Databases are fascinating and going through these details shows the kind of performance we could get from the databases if we know them in and out. ⚡ I keep writing and sharing my practical experience and learnings every day, so if you resonate then follow along. I keep it no fluff. youtube.com/c/ArpitBhayani #AsliEngineering #Databases #PostgreSQL

  • View profile for Kuldeep Singh Sidhu

    Senior Data Scientist @ Walmart | BITS Pilani

    15,203 followers

    Excited to share insights from Walmart 's groundbreaking semantic search system that revolutionizes e-commerce product discovery! The team at Walmart Global Technology(the team that I am a part of 😬) has developed a hybrid retrieval system that combines traditional inverted index search with neural embedding-based search to tackle the challenging problem of tail queries in e-commerce. Key Technical Highlights: • The system uses a two-tower BERT architecture where one tower processes queries and another processes product information, generating dense vector representations for semantic matching. • Product information is enriched by combining titles with key attributes like category, brand, color, and gender using special prefix tokens to help the model distinguish different attribute types. • The neural model leverages DistilBERT with 6 layers and projects the 768-dimensional embeddings down to 256 dimensions using a linear layer, achieving optimal performance while reducing storage and computation costs. • To improve model training, they implemented innovative negative sampling techniques combining product category matching and token overlap filtering to identify challenging negative examples. Production Implementation Details: • The system uses a managed ANN (Approximate Nearest Neighbor) service to enable fast retrieval, achieving 99% recall@20 with just 13ms latency. • Query embeddings are cached with preset TTL (Time-To-Live) to reduce latency and costs in production. • The model is exported to ONNX format and served in Java, with custom optimizations like fixed input shapes and GPU acceleration using NVIDIA T4 processors. Results: The system showed significant improvements in both offline metrics and live experiments, with: - +2.84% improvement in NDCG@10 for human evaluation - +0.54% lift in Add-to-Cart rates in live A/B testing This is a fantastic example of how modern NLP techniques can be successfully deployed at scale to solve real-world e-commerce challenges!

  • View profile for Damien Benveniste, PhD
    Damien Benveniste, PhD Damien Benveniste, PhD is an Influencer

    Building AI Agents

    173,245 followers

    Quantizing is not enough when fine-tuning a model! Even in the lowest precisions, most of the memory is going to be taken by the optimizer state when training that model! One great strategy that emerged recently is QLoRA. The idea is to apply LoRA adapters to quantized models. When the optimizer state is going to be computed, it is only going to be done on the adapter parameters instead of the whole model, and this will save a large amount of memory! The parameters are converted from BFloat16 / Float16 to 4-bits normal float. This quantization strategy comes from the realization that trained model weights tend to be Normal distributed, and we can create quantization buckets using that fact. This allows the compression of the model parameters without too much information loss. When we quantize a model, we need to capture the quantization constants to be able to dequantize the model. We usually capture them in Float32 to avoid as much dequantization error as possible. To compress further the model, we perform a double quantization to quantize the quantization constants to Float8. During the forward pass, because the input tensors are in BFloat16 / Float16, we need to dequantize the quantized parameters to perform the operations. However, during the backward pass, the original weights do not contribute to the computations, and they can remain quantized.

Explore categories