Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup
Paper • 2101.06983 • Published • 2
This is a sentence-transformers model finetuned from Parveshiiii/Embedding on the trivia-qa-triplet dataset. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 1024, 'do_lower_case': False, 'architecture': 'XLMRobertaModel'})
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': False})
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
'The weather is lovely today.',
"It's so sunny outside!",
'He drove to the stadium.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.7635, 0.5915],
# [0.7635, 1.0000, 0.6165],
# [0.5915, 0.6165, 1.0000]])
anchor, positive, and negativeCachedMultipleNegativesRankingLoss with these parameters:{
"scale": 20.0,
"similarity_fct": "cos_sim",
"mini_batch_size": 512,
"gather_across_devices": false,
"directions": [
"query_to_doc"
],
"partition_mode": "joint",
"hardness_mode": null,
"hardness_strength": 0.0
}
per_device_train_batch_size: 4096max_steps: 3230learning_rate: 2e-05warmup_steps: 100optim: adamw_torch_fusedbf16: Truegradient_checkpointing: Trueaccelerator_config: {'split_batches': True, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}per_device_train_batch_size: 4096num_train_epochs: 3.0max_steps: 3230learning_rate: 2e-05lr_scheduler_type: linearlr_scheduler_kwargs: Nonewarmup_steps: 100optim: adamw_torch_fusedoptim_args: Noneweight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08optim_target_modules: Nonegradient_accumulation_steps: 1average_tokens_across_devices: Truemax_grad_norm: 1.0label_smoothing_factor: 0.0bf16: Truefp16: Falsebf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonegradient_checkpointing: Truegradient_checkpointing_kwargs: Nonetorch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Noneuse_liger_kernel: Falseliger_kernel_config: Noneuse_cache: Falseneftune_noise_alpha: Nonetorch_empty_cache_steps: Noneauto_find_batch_size: Falselog_on_each_node: Truelogging_nan_inf_filter: Trueinclude_num_input_tokens_seen: nolog_level: passivelog_level_replica: warningdisable_tqdm: Falseproject: huggingfacetrackio_space_id: trackioeval_strategy: noper_device_eval_batch_size: 8prediction_loss_only: Trueeval_on_start: Falseeval_do_concat_batches: Trueeval_use_gather_object: Falseeval_accumulation_steps: Noneinclude_for_metrics: []batch_eval_metrics: Falsesave_only_model: Falsesave_on_each_node: Falseenable_jit_checkpoint: Falsepush_to_hub: Falsehub_private_repo: Nonehub_model_id: Nonehub_strategy: every_savehub_always_push: Falsehub_revision: Noneload_best_model_at_end: Falseignore_data_skip: Falserestore_callback_states_from_checkpoint: Falsefull_determinism: Falseseed: 42data_seed: Noneuse_cpu: Falseaccelerator_config: {'split_batches': True, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}parallelism_config: Nonedataloader_drop_last: Truedataloader_num_workers: 0dataloader_pin_memory: Truedataloader_persistent_workers: Falsedataloader_prefetch_factor: Noneremove_unused_columns: Truelabel_names: Nonetrain_sampling_strategy: randomlength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falseddp_backend: Noneddp_timeout: 1800fsdp: []fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}deepspeed: Nonedebug: []skip_memory_metrics: Truedo_predict: Falseresume_from_checkpoint: Nonewarmup_ratio: Nonelocal_rank: -1prompts: Nonebatch_sampler: batch_samplermulti_dataset_batch_sampler: proportionalrouter_mapping: {}learning_rate_mapping: {}| Epoch | Step | Training Loss |
|---|---|---|
| 0.0031 | 10 | 7.6224 |
| 0.0062 | 20 | 7.6213 |
| 0.0093 | 30 | 7.6198 |
| 0.0124 | 40 | 7.6083 |
| 0.0155 | 50 | 7.5609 |
| 0.0186 | 60 | 7.4522 |
| 0.0217 | 70 | 7.3751 |
| 0.0248 | 80 | 7.3387 |
| 0.0279 | 90 | 7.3863 |
| 0.0310 | 100 | 7.2047 |
| 0.0341 | 110 | 7.2849 |
| 0.0372 | 120 | 7.2857 |
| 0.0402 | 130 | 7.3164 |
| 0.0433 | 140 | 7.2506 |
| 0.0464 | 150 | 7.4432 |
| 0.0495 | 160 | 7.2519 |
| 0.0526 | 170 | 7.3358 |
| 0.0557 | 180 | 7.2496 |
| 0.0588 | 190 | 7.3306 |
| 0.0619 | 200 | 7.2377 |
| 0.0650 | 210 | 7.2976 |
| 0.0681 | 220 | 7.2039 |
| 0.0712 | 230 | 7.1852 |
| 0.0743 | 240 | 7.2373 |
| 0.0774 | 250 | 7.2902 |
| 0.0805 | 260 | 7.2145 |
| 0.0836 | 270 | 7.2598 |
| 0.0867 | 280 | 7.3147 |
| 0.0898 | 290 | 7.1940 |
| 0.0929 | 300 | 7.2009 |
| 0.0960 | 310 | 7.2074 |
| 0.0991 | 320 | 7.3131 |
| 0.1022 | 330 | 7.2124 |
| 0.1053 | 340 | 7.1579 |
| 0.1084 | 350 | 7.1688 |
| 0.1115 | 360 | 7.2484 |
| 0.1146 | 370 | 7.2506 |
| 0.1176 | 380 | 7.1243 |
| 0.1207 | 390 | 7.2264 |
| 0.1238 | 400 | 7.3368 |
| 0.1269 | 410 | 7.3014 |
| 0.1300 | 420 | 7.2524 |
| 0.1331 | 430 | 7.0409 |
| 0.1362 | 440 | 7.1438 |
| 0.1393 | 450 | 7.2448 |
| 0.1424 | 460 | 7.2018 |
| 0.1455 | 470 | 7.2354 |
| 0.1486 | 480 | 7.2031 |
| 0.1517 | 490 | 7.2163 |
| 0.1548 | 500 | 7.1130 |
| 0.1579 | 510 | 7.1783 |
| 0.1610 | 520 | 7.1934 |
| 0.1641 | 530 | 7.1669 |
| 0.1672 | 540 | 7.1286 |
| 0.1703 | 550 | 7.1773 |
| 0.1734 | 560 | 7.2205 |
| 0.1765 | 570 | 7.0962 |
| 0.1796 | 580 | 7.3322 |
| 0.1827 | 590 | 7.1580 |
| 0.1858 | 600 | 7.0881 |
| 0.1889 | 610 | 7.1334 |
| 0.1920 | 620 | 7.0562 |
| 0.1950 | 630 | 7.2170 |
| 0.1981 | 640 | 7.1307 |
| 0.2012 | 650 | 7.1279 |
| 0.2043 | 660 | 7.0545 |
| 0.2074 | 670 | 7.2590 |
| 0.2105 | 680 | 7.1954 |
| 0.2136 | 690 | 7.0225 |
| 0.2167 | 700 | 7.1797 |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{gao2021scaling,
title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
year={2021},
eprint={2101.06983},
archivePrefix={arXiv},
primaryClass={cs.LG}
}