Can you release the sharegpt dataset? #90

LZY-the-boys · 2023-03-31T07:07:27Z

I am wandering can the sharegpt data be released?

ari9dam · 2023-03-31T11:19:21Z

If data can't be released, can you please share the code for dataset crawling and all the processing you did to get markdown from HTML?

MarkSchmidty · 2023-03-31T15:51:43Z

Up until 2 days ago shareGPT had an explore page which could be easily scraped. They removed that page to prevent scraping.

Kreijstal · 2023-03-31T15:52:07Z

"Open-Source"

No Weights
No Dataset
No checkpoints

That's not open source. Not at all, don't claim to be open.

merrymercy · 2023-03-31T20:05:19Z

Hi @Kreijstal, @LZY-the-boys and @ari9dam

Thanks for your interest! We plan to release the weights once we have addressed all concerns and have a low-resource version of the inference code ready. We released the demo first to get some early feedback on the model.

We have no current plans to release the dataset and will first communicate with the ShareGPT team.

The data cleaning script is this

FastChat/fastchat/data/clean_sharegpt.py

Lines 1 to 3 in 6f42570

    
           """ 
        
           Usage: python3 -m fastchat.data.clean_sharegpt --in sharegpt_html.json --out sharegpt_clean.json 
        
           """

timatom · 2023-03-31T21:13:43Z

@merrymercy,

In terms of the dataset, is avoiding the release out of respect to the ShareGPT team disabling their endpoint? My understanding is it was for security reasons, which I can respect.

If so, do you know of any efforts being made to make public datasets for building foundational models like Vicuna? If not, do you know of any resource that could help others interested in such efforts?

Jeffwan · 2023-04-01T23:49:32Z

@merrymercy Seems the clean_sharegpt accepts a json file. I don't know whether https://sharegpt.com/ provides json or now. Do you have some process to convert the pure HTML page to json the scripts expect?

MarkSchmidty · 2023-04-02T07:08:54Z

ShareGPT Dataset:

Zipped jsons with 90 000 conversations from sharegpt. Split in two files with 45k each:
part 1: https://files.catbox.moe/bhtp9i.zip
part 2: https://files.catbox.moe/ahoivx.zip

Format should work as is for training. Use clean tool to remove html markup: https://github.com/lm-sys/FastChat/blob/main/docs/commands/data_cleaning.md

(Note: I'm just relaying this info from someone who sent it my way. So I don't know anything more than anyone else.)

The entire pre-cleaned 90k conversation dataset is also available here: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/tree/main/HTML_cleaned_raw_dataset

A pre-cleaned, English only, "unfiltered," and 2048 token split version of the ShareGPT dataset ready for finetuning is available here: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered

Kreijstal · 2023-04-02T07:56:13Z

@MarkSchmidty
you are doing god's work, good job, democratizing AI

timatom · 2023-04-02T08:07:42Z

For all you scrapers out there, there's another site that also has ChatGPT conversations that's rather easy:

https://chatlogs.net/

It has around 80k conversations from what I can tell.

BadisG · 2023-04-02T12:20:07Z

ShareGPT Dataset:

Zipped jsons with 90 000 conversations from sharegpt. Split in two files with 45k each: part 1: https://files.catbox.moe/bhtp9i.zip part 2: https://files.catbox.moe/ahoivx.zip

Format should work as is for training. Use clean tool to remove html markup: https://github.com/lm-sys/FastChat/blob/main/docs/commands/data_cleaning.md

(Note: I'm just relaying this info from someone who sent it my way. So I don't know anything more than anyone else.)

The entire pre-cleaned 90k conversation dataset is also available here: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/tree/main/HTML_cleaned_raw_dataset

A pre-cleaned, English only, "unfiltered," and 2048 token split version of the ShareGPT dataset ready for finetuning is available here: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered

Not all heroes wear capes!

clulece · 2023-04-03T06:56:38Z

@MarkSchmidty Thank you for providing the higher quality version that has all the senseless/misguided OpanAI moralizing purged.

DemonFemaleAlpha1 · 2023-04-05T15:48:20Z

hello

alanxmay · 2023-04-07T07:26:18Z

ShareGPT Dataset:

Zipped jsons with 90 000 conversations from sharegpt. Split in two files with 45k each: part 1: https://files.catbox.moe/bhtp9i.zip part 2: https://files.catbox.moe/ahoivx.zip

Format should work as is for training. Use clean tool to remove html markup: https://github.com/lm-sys/FastChat/blob/main/docs/commands/data_cleaning.md

(Note: I'm just relaying this info from someone who sent it my way. So I don't know anything more than anyone else.)

The entire pre-cleaned 90k conversation dataset is also available here: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/tree/main/HTML_cleaned_raw_dataset

A pre-cleaned, English only, "unfiltered," and 2048 token split version of the ShareGPT dataset ready for finetuning is available here: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered

I fine-tuned the 13B model using the dataset from the huggingface link above, but the model's performance was poor - in some cases it failed to correctly output the end symbol.

BadisG · 2023-04-07T11:23:23Z

@alanxmay you fine tuned it with the unfiiltered dataset?

ethanyanjiali · 2023-04-07T23:25:13Z

"Open-Source"

No Weights

No Dataset

No checkpoints

That's not open source. Not at all, don't claim to be open.

Don't take everything for granted. Given that OpenAI is so closed, I feel really respectful for Meta to release LLaMA and also all the research groups that released follow-up works on LLaMA.

BadisG · 2023-04-07T23:27:59Z

Don't take everything for granted. Given that OpenAI is so closed,

ClosedAI, they really parted way with all the nice principles they once had.

zhisbug · 2023-04-08T02:23:41Z

Closing this issue for now.

So far, we have released the

weights, 7B and 13B (and checkpoints)
training recipes
data processing scripts
various ways to run the bot on diverse hardware

We're unable to release the data due to various factors out of our control.

We'll keep pushing the limit and get the community better and more open LLMs!

"Open-Source"

No Weights

No Dataset

No checkpoints

That's not open source. Not at all, don't claim to be open.

alanxmay · 2023-04-10T01:35:15Z

@alanxmay you fine tuned it with the unfiiltered dataset?

@BadisG Yes, I am using this one: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_unfiltered_cleaned_split.json

BadisG · 2023-04-10T01:42:31Z

@alanxmay How did it go? Did you manage to make it better?

eeric · 2023-04-15T04:49:01Z

@MarkSchmidty
how to generate sg_90k_part1_clear.json?
how to generate sg_90k_part1_clear.json?
in addition, how to crawl data from sharegpt.com?
do you have crawl data script?

MarkSchmidty · 2023-04-15T04:51:11Z

I didn't generate these. I was sent them by an anonymous source.

It's not possible to crawl sharegpt anymore. Sharegpt used to have a page which you could crawl. But now it does not.

eeric · 2023-04-15T04:54:46Z

ok, that's sad news.

timatom · 2023-04-16T20:07:20Z

ok, that's sad news.

Theoretically, you could scrape Twitter. Anything someone share's publicly on social media is fair game to scrape, technically.

abhinavchoudhry · 2023-05-31T21:51:40Z

ok, that's sad news.

Theoretically, you could scrape Twitter. Anything someone share's publicly on social media is fair game to scrape, technically.

Yeah, but they are charging exorbitant fees for scraping now. Twitter is as good as closed now, at least for ordinary developers and researchers. Academic access RIP.

timatom · 2023-06-05T16:04:08Z

ok, that's sad news.

Theoretically, you could scrape Twitter. Anything someone share's publicly on social media is fair game to scrape, technically.

Yeah, but they are charging exorbitant fees for scraping now. Twitter is as good as closed now, at least for ordinary developers and researchers. Academic access RIP.

Ya, it's another case of sad news. Not sure how this is all going to play out long-term. Best of luck to everyone.

kkkparty · 2023-07-20T03:36:58Z

@alanxmay you fine tuned it with the unfiiltered dataset?

@BadisG Yes, I am using this one: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_unfiltered_cleaned_split.json

i use this dataset with baichuan 7B model, and command as follows :CUDA_VISIBLE_DEVICES="7" torchrun --nproc_per_node=1 --master_port=20001 fastchat/train/train_baichuan.py --model_name_or_path /workspace/baichuan/model_para/Baichuan-7B --data_path /workspace/baichuan/dataset/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json --bf16 False --output_dir output_baichuan --num_train_epochs 3 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 16 --evaluation_strategy "no" --save_strategy "steps" --save_steps 1200 --save_total_limit 10 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' --tf32 False --model_max_length 64 --gradient_checkpointing True --lazy_preprocess True ,
but it crushed with dataset problems. is there some procedure about using sharegpt datasert with basichuan 7B weights should i take?

tiendung mentioned this issue Apr 3, 2023

Các nguồn dữ liệu (GPT4, ShareGPT, Dolly 2.0 ...) telexyz/GPT4VN#1

Open

7 tasks

ehartford mentioned this issue Apr 6, 2023

publish the dataset #221

Closed

merrymercy added the question Further information is requested label Apr 8, 2023

zhisbug closed this as completed Apr 8, 2023

CiaoHe mentioned this issue Apr 15, 2023

accroding to data_cleaning.md, how to get sharegpt_20230322_html.json firstly? #440

Closed

shm007g mentioned this issue Apr 19, 2023

mult-modal / tool learning shm007g/LLaMA-Cult-and-More#2

Open

shm007g mentioned this issue May 30, 2023

main page shm007g/LLaMA-Cult-and-More#1

Open

imoneoi mentioned this issue Sep 25, 2023

Question about the data source for Openchat-v3.2-super imoneoi/openchat#54

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can you release the sharegpt dataset? #90

Can you release the sharegpt dataset? #90

LZY-the-boys commented Mar 31, 2023

ari9dam commented Mar 31, 2023

MarkSchmidty commented Mar 31, 2023 •

edited

Loading

Kreijstal commented Mar 31, 2023

merrymercy commented Mar 31, 2023 •

edited

Loading

timatom commented Mar 31, 2023 •

edited

Loading

Jeffwan commented Apr 1, 2023

MarkSchmidty commented Apr 2, 2023 •

edited

Loading

Kreijstal commented Apr 2, 2023

timatom commented Apr 2, 2023

BadisG commented Apr 2, 2023

clulece commented Apr 3, 2023

DemonFemaleAlpha1 commented Apr 5, 2023

alanxmay commented Apr 7, 2023

BadisG commented Apr 7, 2023

ethanyanjiali commented Apr 7, 2023

BadisG commented Apr 7, 2023 •

edited

Loading

zhisbug commented Apr 8, 2023

alanxmay commented Apr 10, 2023 •

edited

Loading

BadisG commented Apr 10, 2023

eeric commented Apr 15, 2023

MarkSchmidty commented Apr 15, 2023 •

edited

Loading

eeric commented Apr 15, 2023

timatom commented Apr 16, 2023

abhinavchoudhry commented May 31, 2023

timatom commented Jun 5, 2023

kkkparty commented Jul 20, 2023

Can you release the sharegpt dataset? #90

Can you release the sharegpt dataset? #90

Comments

LZY-the-boys commented Mar 31, 2023

ari9dam commented Mar 31, 2023

MarkSchmidty commented Mar 31, 2023 • edited Loading

Kreijstal commented Mar 31, 2023

merrymercy commented Mar 31, 2023 • edited Loading

timatom commented Mar 31, 2023 • edited Loading

Jeffwan commented Apr 1, 2023

MarkSchmidty commented Apr 2, 2023 • edited Loading

Kreijstal commented Apr 2, 2023

timatom commented Apr 2, 2023

BadisG commented Apr 2, 2023

clulece commented Apr 3, 2023

DemonFemaleAlpha1 commented Apr 5, 2023

alanxmay commented Apr 7, 2023

BadisG commented Apr 7, 2023

ethanyanjiali commented Apr 7, 2023

BadisG commented Apr 7, 2023 • edited Loading

zhisbug commented Apr 8, 2023

alanxmay commented Apr 10, 2023 • edited Loading

BadisG commented Apr 10, 2023

eeric commented Apr 15, 2023

MarkSchmidty commented Apr 15, 2023 • edited Loading

eeric commented Apr 15, 2023

timatom commented Apr 16, 2023

abhinavchoudhry commented May 31, 2023

timatom commented Jun 5, 2023

kkkparty commented Jul 20, 2023

MarkSchmidty commented Mar 31, 2023 •

edited

Loading

merrymercy commented Mar 31, 2023 •

edited

Loading

timatom commented Mar 31, 2023 •

edited

Loading

MarkSchmidty commented Apr 2, 2023 •

edited

Loading

BadisG commented Apr 7, 2023 •

edited

Loading

alanxmay commented Apr 10, 2023 •

edited

Loading

MarkSchmidty commented Apr 15, 2023 •

edited

Loading