Hong Chen - Facebook | LinkedIn (original) (raw)

Yizhe Zhang Apple • 3K followers We (w/ Shansan Gong, Ruixiang ZHANG, Huangjie Zheng, Jiatao Gu, Navdeep Jaitly, Lingpeng Kong) released a family of 7B diffusion language models, DiffuCoder, that specializes on code generation, with a focus on understanding and improving masked diffusion models. A core analysis of DiffuCoder is the autoregressiveness (AR-ness) score, a novel metric that quantifies the causal patterns in decoding, revealing how diffusion models break from strict left-to-right generation for more flexible, non-linear code planning. Recent advances in autoregressive (AR) models dominate code generation, but diffusion-based LLMs (dLLMs) like DiffuCoder offer a promising alternative, especially for complex programming tasks. DiffuCoder explores how these models decode differently—showing less global AR-ness in code tasks compared to math—and how temperature affects both token selection and generation order, unlike traditional AR models. We also introduce coupled-GRPO, a post-training RL method with a coupled-sampling scheme, to reduce performance drops during accelerated decoding, boosting parallelism and efficiency. We use a self-improvement pipeline that leverages AR-ness analysis, coupled-GRPO optimization, and evaluation on benchmarks like AceCode-89k to refine decoding strategies. This approach enables DiffuCoder to navigate diverse code generation pathways and enhance performance with modest computational overhead. Looking ahead, we aim to further leverage Reinforcement Learning to steer code generation through these decoding patterns, with the discrete nature of AR-ness scores providing a foundation for search-based strategies—ideal for the sparse rewards of optimizing complex code structures. Check out our full paper and code for a deeper dive! Paper: https://lnkd.in/gVWU3BDJ Code: https://lnkd.in/gmXTZ_6n Models: https://lnkd.in/gTcKCDr9 #MachineLearning #AI #CodeGeneration #DiffusionModels #NLP
Antonio Mallia Seltz • 4K followers ⚡ Exciting to see our Block-Max Pruning (BMP) technique in Infinity, an open-source AI-native database designed for LLM applications! In their latest VLDB paper, “Balancing the Blend: An Experimental Analysis of Trade-offs in Hybrid Search”, Hai Jin, Yingfeng Zhang, and co-authors present a rigorous evaluation of hybrid search architectures — combining full-text, sparse, dense, and tensor retrieval. To support efficient sparse vector search at scale, they’ve integrated BMP into Infinity’s SVS engine — a nice validation of our work on fast, top-k lexical retrieval. 🔗 BMP paper: https://lnkd.in/dsc33hGc 🔗 BMP code: https://lnkd.in/dxBxv225 🔗 Infinity: https://lnkd.in/ddRK5mbr 🔗 Hybrid Search paper: https://lnkd.in/dfBuDXmt Great to see ideas from traditional IR continuing to shape the next generation of retrieval infrastructure!
Emilio Andere Wafer • 14K followers Here are some simple habits that make your GPU profiling more useful when making AI models run faster Profilers (e.g. Nsight Compute) report tiny timing/counter diffs. So, run‑to‑run variance, different inputs, or host jitter can swamp the signal. If you profile different shapes/inputs each time, or switch between your macOS viewer and a remote Linux GPU box without locking inputs, you’ll “see” a lot of changes that are just noise. So, here's your 5-minute checklist for your next profiling run: 1. Freeze the workload a. Choose 1 representative input/shape (e.g., batch=16, seq=4096, hidden=8192). b. Fix seeds: CUDA_LAUNCH_BLOCKING=0, torch.manual_seed(1337), np.random.seed(1337) c. Disable randomness 2. Pin the environment a. Same container image, driver, CUDA toolkit, clocks. b. Make sure the remote GPU box is otherwise idle. 3. Script it a. One command that: warms up → runs the golden case → emits an NCU report. 4. Name your reports by hypothesis a. golden_baseline.ncu-rep, golden_prefetchL2.ncu-rep, golden_unroll4.ncu-rep. b. Compare only golden vs golden. Don’t mix shapes. 5. Use stable metrics a. Prefer counters less sensitive to jitter: achieved occupancy, eligible warps per cycle, DRAM throughput, L2 hit rate, stall breakdowns, executed instructions. b. Treat small wall‑time changes (<2–3%) as noise until multiple runs agree. #GPUProfiling #NsightCompute #CUDA #PerformanceEngineering
Arham Mehta NVIDIA • 6K followers Excited about the release of Nemotron 3 Super - 120B total / 12B active, Hybrid SSM Latent MoE, designed for Blackwell, pre-trained in NVFP4 and top spot on AA index for its size! For data at scale, we used NeMo Curator to curate over 10 trillion tokens for LLM pre-training, and open sourced our data curation recipe so you can use the same pipeline for your own workflows that includes GPU-accelerated deduplication, quality filtering, model ensembling, and more. 🔗 NeMo Curator: open-source recipe and code so you can curate high-quality datasets for your use case: https://lnkd.in/ggykjqa2 🔗 Technical report: https://lnkd.in/gmxhBK89 Mostofa Patwary, Markus Kliegl, Ayush Dattagupta, Vibhu Jawa, Abhinav Garg, Praateek Mahajan, Sarah Yurick, Bartley Richardson, Randy Gelhausen, Ashwath Aithal, Nima Tajbakhsh #AI #dataprocessing #datacuration #NVIDIA #OpenSource #AIdeveloper
Michael Diarra CPA, CISA, CDPSE 7K followers In this groundbreaking paper published on December 11, 2025, Delong Chen and a team from Meta FAIR dismantle the "next-token" guessing game by introducing VL-JEPA, a radical departure from the architecture of traditional AI. Unlike ChatGPT, Claude, or Gemini, which rely on computationally heavy autoregressive generation, this model predicts continuous semantic embeddings in a latent space to understand the world without needing to generate every single pixel or word. This shift is transformative because it enables deeper multimodal reasoning with 50% fewer parameters, proving that the future of intelligence lies in efficient world modeling rather than just imitating surface-level linguistic patterns. #MachineLearning #MetaAI #VLJEPA #GenerativeAI #ComputerVision #AIResearch #WorldModels #FutureOfAI
Gunaputra Nagendra Pavan Yedida Discensys • 5K followers I just read “Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs”, which extends Chinchilla-style laws to include architecture choices like hidden size, MLP/attention balance, and GQA, then uses them to find designs that are both cheaper and more accurate than LLaMA-3.2 under the same training budget. 🔗 https://lnkd.in/gYnqdc6R Insights from the paper: 🔹 Conditional scaling law: Augments classic compute-optimal scaling with architectural variables (hidden size, mlp-to-attention ratio, GQA groups) so loss becomes a function of both scale and shape, not just total params/tokens. 🔹 Architecture–efficiency tradeoffs: Training 200+ models (80M–3B, 8B–100B tokens) shows that shifting parameters from MLP to attention, and using well-tuned GQA, can significantly improve accuracy at the same FLOPs and memory footprint. 🔹 Search framework: Fits the conditional law, then searches over design choices to predict Pareto-optimal points, yielding models with up to 2.1% higher accuracy and 42% higher inference throughput than LLaMA-3.2 for the same training compute. 🔹 Practical takeaway: Instead of “Chinchilla but bigger,” the work argues for architecture-aware scaling as a new axis—optimizing how parameters are wired can matter as much as how many you have. This is a good resource if you’re designing custom LLMs, worrying about serving costs, or exploring how scaling laws and architecture search can be combined for inference-efficient models. #AI #MachineLearning #ScalingLaws #LLM #ModelArchitecture #InferenceEfficiency