Transcripts by file upload #969

brandonhundt · 2024-02-16T17:57:35Z

Per our discussion about Step 1 for transcripts, we want to get some basic support in Feeder. See the doc: https://docs.google.com/document/d/1-0bOHkWMd0LUKt-1sw5h-oUlqwAdVYmtqxPyk4UGEpE/edit#bookmark=id.9j3sg0ayyaju

Transcript file made available to RSS feed and API (one file per episode), Validate and "shove out" a transcript with the episode. RSS and API
Add a field to UI. VTT and SRT supported (add to media files tab)

Tasks (separate issues/PRs)

Give feedback

The text was updated successfully, but these errors were encountered:

kookster · 2024-03-12T12:09:45Z

Should we allow multiple transcript files? There might be an uploaded version and a machine generated, should we be able to have both? The element in the feed allows multiple in different formats, we could allow the same. I wonder if we should also keep track of where/which app generated a transcript...like, did this VTT come from Apple, from Descript, AWS?

Should we also support human readable, but not terribly machine friendly formats like txt, MS Word docs or even pdfs?

Would we have storage of individual audio file segment transcripts? I think if we generate the transcript from segments, those would get generated and stored next to the audio files in s3, like we do with other derivatives. But then what does that mean for the overall transcript? Seems like we'd need a (on demand? async?) process to make a combined transcript from them. Perhaps a field on the MediaResource could indicate if there is a transcript for it, or we could allow a MediaResource to have a Transcript...

So what would the model/storage for this episode transcript look like?

I'm thinking like an EpisodeImage or MediaResource, with the original_url where it is uploaded (probably an s3 url), and the url for the public location. Since we have different standards, include a format or mime_type to designate if they uploaded/generated a VTT, podcasts:transcript json, or SRT file.

I think this is its own table/model, for Transcript, and Episode has many Transcript.

cavis · 2024-05-21T18:42:44Z

Implementation (these could be broken up into separate PRs):

Create a new transcripts table/model
- Similar to EpisodeImage fields: status, guid, url, original_url, mime_type, file_size, format
- To start, let's treat it has episode has-one transcript. We can always change that to many later.
- Make paranoid.
Create a new CopyTranscriptTask as a subclass of Task
- To start, fire off porter Inspect / Copy tasks and parse the results back into the Transcript
Create UI to upload the transcript
- On save in the controller, set the original_url/file_size (and randomize the guid) on the Transcript model
- Kick off a copy transcript task in the episode.copy_media hooks
- Add validations to the transcript that it has an acceptable format (TODO: what mime_types does porter Inspect return for these transcript files?)
Add transcript link to the RSS

brandonhundt added the high label Feb 16, 2024

brandonhundt added the Epic label Feb 26, 2024

brandonhundt self-assigned this Feb 26, 2024

This was referenced Jun 18, 2024

Create a new transcripts table/model #1047

Closed

Create a new CopyTranscriptTask as a subclass of Task #1048

Closed

This was referenced Jul 30, 2024

Transcript upload UI #1060

Closed

Transcript link in RSS #1061

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transcripts by file upload #969

Transcripts by file upload #969

brandonhundt commented Feb 16, 2024 •

edited by cavis

Loading

Tasks (separate issues/PRs)

kookster commented Mar 12, 2024

cavis commented May 21, 2024 •

edited

Loading

Transcripts by file upload #969

Transcripts by file upload #969

Comments

brandonhundt commented Feb 16, 2024 • edited by cavis Loading

Tasks (separate issues/PRs)

kookster commented Mar 12, 2024

cavis commented May 21, 2024 • edited Loading

brandonhundt commented Feb 16, 2024 •

edited by cavis

Loading

cavis commented May 21, 2024 •

edited

Loading