Added xformers support to Llama (#950)

2023-04-09 22:08:40 -04:00
parent 625d81f495
commit 992663fa20
4 changed files with 185 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -215,6 +215,8 @@ Optionally, you can use the following command-line flags:
 | `--load-in-8bit`                            | Load the model with 8-bit precision.|
 | `--bf16`                                    | Load the model with bfloat16 precision. Requires NVIDIA Ampere GPU. |
 | `--no-cache`                                | Set `use_cache` to False while generating text. This reduces the VRAM usage a bit with a performance cost. |
+| `--xformers`                                | Use xformer's memory efficient attention. This should increase your tokens/s. |
+| `--sdp-attention`                           | Use torch 2.0's sdp attention. |

 #### llama.cpp