Thursday, February 29

Ferret: An End-to-End MLLM by Apple

Ferret: Refer and Ground Anything Anywhere at Any Granularity

An End-to-End MLLM that Accept Any-Form Referring and Ground Anything in Response. [Paper]

Haoxuan You *, Haotian Zhang *, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, Yinfei Yang
[*: equal contribution]


Diagram of Ferret Model.

Secret Contributions:

  • Ferret Model – Hybrid Region Representation + Spatial-aware Visual Sampler make it possible for fine-grained and open-vocabulary referring and grounding in MLLM.
  • GRIT Dataset (~ 1.1 M) – A Large-scale, Hierarchical, Robust ground-and-refer guideline tuning dataset.
  • Ferret-Bench – A multimodal examination standard that collectively needs Referring/Grounding, Semantics, Knowledge, and Reasoning.


  • [12/14] We launched the checkpoints(7B, 13B).
  • [10/30] We launched the code of FERRET design and Ferret-Bench.

Use and License Notices: The information, and code is meant and certified for research study usage just. They are likewise limited to usages that follow the license arrangement of LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (enabling just non-commercial usage) and designs trained utilizing the dataset must not be utilized beyond research study functions.


  • Set up
  • Train
  • Assessment
  • Demonstration

Set up

  1. Clone this repository and browse to FERRET folder

git clone cd ml-ferret

  1. Set up Package

conda develop -n ferret python=3.10 -y conda trigger ferret pip set up– upgrade pip # make it possible for PEP 660 assistance pip set up -e. pip set up pycocotools pip set up protobuf==3.20.0

  1. Set up extra plans for training cases

pip set up ninja pip set up flash-attn– no-build-isolation Train

FERRET is trained on 8 A100 GPUs with 80GB memory. To train on less GPUs, you can decrease the per_device_train_batch_size and increase the gradient_accumulation_steps appropriately. Constantly keep the international batch size the exact same: per_device_train_batch_size x gradient_accumulation_steps x num_gpus.


We utilize a comparable set of hyperparameters as LLaVA(Vicuna) in finetuning.

Hyperparameter Global Batch Size Learning rate Epochs Max length Weight decay FERRET-7B 128 2e-5 3 2048 0 FERRET-13B 128 2e-5 3 2048 0 Prepare Vicuna checkpoint and LLaVA’s projector

Before you begin, prepare our base design Vicuna, which is an instruction-tuned chatbot. Please download its weights following the directions here. Vicuna v1.3 is utilized in FERRET.

Download LLaVA’s first-stage pre-trained projector weight (7B, 13B).

FERRET Training

The scripts are offered (7B, 13B).


Please see this doc for the information.


We drew out the delta in between our pre-trained design and Vicuna. Please initially download weights of Vicuna following the previous direction. Download our ready offsets of weights: 7B, 13B utilizing wget or curl, and unzip the downloaded offsets. Use the balanced out to the Vicuna’s weight by running the following script:

# 7B python3 -m ferret.model.apply _ delta — base./ model/vicuna -7 b-v1-3 — target./ model/ferret -7 b-v1-3 — delta path/to/ferret -7 b-delta # 13B python3 -m ferret.model.apply _ delta — base./ model/vicuna -13 b-v1-3 — target./ model/ferret -13 b-v1-3 — delta path/to/ferret -13 b-delta

Notifications: Apple’s rights in the connected weight differentials are thus certified under the CC-BY-NC license.

ยป …
Find out more