airllm

库

lyogavin/airllm

在 4GB 显存单卡上运行 70B 大语言模型，无需量化。

项目简介

AirLLM 优化推理内存使用，使得 70B 等大语言模型可以在单张 4GB 显卡上运行，无需量化、蒸馏或剪枝。支持将 Llama 3.1 405B 部署到 8GB 显存，并提供分块量化加速。

README 预览

\n\n[**Quickstart**](#quickstart) | \n[**Configurations**](#configurations) | \n[**MacOS**](#macos) | \n[**Example notebooks**](#example-python-notebook) | \n[**FAQ**](#faq)\n\n**AirLLM** optimizes inference memory usage, allowing 70B large language models to run inference on a single 4GB GPU card without quantization, distillation and pruning. And you can run **405B Llama3.1** on **8GB vram** now.\n\n\n[](https://pepy.tech/project/airllm)\n\n[](https://github.com/LianjiaTech/BELLE/blob/main/LICENSE)\n[](https://static.aicompose.cn/static/wecom_barcode.png?t=1671918938)\n[](https://discord.gg/2xffU5sn)\n[\n](https://pypi.org/project/airllm/)\n[](https://medium.com/@lyo.gavin)\n[](https://gavinliblog.com)\n[](https://patreon.com/gavinli)\n[](https://github.com/sponsors/lyogavin)\n\n## AI Agents Recommendation:\n\n* [Best AI Game Sprite Generator](https://godmodeai.co)\n\n* [Best AI Facial Expression Editor](https://crazyfaceai.com)\n\n## Updates\n[2024/08/20] v2.11.0: Support Qwen2.5\n\n[2024/08/18] v2.10.1 Support CPU inference. Support non sharded models. Thanks @NavodPeiris for the great work! \n\n[2024/07/30] Support Llama3.1 **405B** ([example notebook](https://colab.research.google.com/github/lyogavin/airllm/blob/main/air_llm/examples/run_llama3.1_405B.ipynb)). Support **8bit/4bit quantization**.\n\n[2024/04/20] AirLLM supports Llama3 natively already. Run Llama3 70B on 4GB single GPU.\n\n[2023/12/25] v2.8.2: Support MacOS running 70B large language models.\n\n[2023/12/20] v2.7: Support AirLLMMixtral. \n\n[2023/12/20] v2.6: Added AutoModel, automatically detect model type, no need to provide model class to initialize model.\n\n[2023/12/18] v2.5: added prefetching to overlap the model loading and compute. 10% speed improvement.\n\n[2023/12/03] added support of **ChatGLM**, **QWen**, **Baichuan**, **Mistral**, **InternLM**!\n\n[2023/12/02] added support for safetensors. Now support all top 10 models in open llm leaderboard.\n\n[2023/12/01] airllm 2.0. Support com

常见问题 (1)

故障排除

在Mac上使用airllm运行llama模型时，如何修复“ValueError: Cannot index mlx array using the given type”错误？

此错误出现在使用airllm搭配MLX后端在Apple Silicon上运行时，因为模型期望MLX数组，但输入的token IDs是以PyTorch张量形式提供的。解决方法是将输入tokens进行转换：使用 import mlx.core as mx; generation_output = model.generate(mx.array(input_tokens['input_ids'])) 替代 input_tokens['input_ids'].cuda()。这样可以确保与MLX的嵌入层兼容。该问题在airllm v2.9.1、Python 3.12.4、mlx版本14.1–16.1、macOS 14.5及M1 Pro上被报告。

来源 Issue #167

airllm

项目简介

README 预览

常见问题 (1)

同类型项目

puppeteer

PaddleOCR

crawl4ai

prisma