Diarization Output Modified #1586

happyhuman · 2018-07-19T22:09:36Z

Printing the last paragraph only.

happyhuman · 2018-07-19T22:25:00Z

@dizcology , @tswast , @sirtorry I had to adjust the output in this sample a little. Can you please review this PR? Thanks.

dizcology · 2018-07-20T16:35:55Z

speech/cloud-client/beta_snippets.py

+    words_info = result.alternatives[0].words
+    pieces = ['%s (%s)' % (word_info.word, word_info.speaker_tag)
+              for word_info in words_info]
+    print(' '.join(pieces))


I feel this might not illustrate the "typical" use case, where the developer might more likely want to group and join the words according to their speaker_tag.

Interesting point. Tough to say what the right use case is. But I see it just as a sample. to show them the API, and not the use case. Do you think we can keep it as is, or should we change it?

I agree - in that sense perhaps let's just iterate through words_info and print everything without the nice formatting of '%s (%s)'. the expected output of the test really confused me.

I think I agree with the way it's formatted but whatever you decide consider this syntax

pieces = ['{} ({})'.format(word_info.word, word_info.speaker_tag) for word_info in words_info]

I made some changes to make the output to look like this:

Speaker #1: I'm here
Speaker #2: hi I'd like to buy a Chrome Cast and I was wondering whether you could help me

jerjou · 2018-07-20T18:13:07Z

speech/cloud-client/beta_snippets.py

-              .format(i, alternative.transcript))
-        print('Speaker Tag for the first word: {}'
-              .format(alternative.words[0].speaker_tag))
+    result = response.results[-1]


A comment here explaining why you're only taking the last result (instead of all of them) would probably be helpful.

Good idea. Adding it.

jerjou · 2018-07-20T18:13:11Z

speech/cloud-client/beta_snippets.py

@@ -46,7 +46,6 @@ def transcribe_file_with_enhanced_model(speech_file):
    audio = speech.types.RecognitionAudio(content=content)
    config = speech.types.RecognitionConfig(
        encoding=speech.enums.RecognitionConfig.AudioEncoding.LINEAR16,
-        sample_rate_hertz=8000,


Any particular reason for omitting this? For WAV files, the API can infer this, but in the general case it's probably a good idea to include the sample rate

I keep going back and forth about it. Sometimes it is useful, and other times it is causing an error when the input file has a different sample rate.

jerjou · 2018-07-20T18:15:10Z

speech/cloud-client/beta_snippets.py

-              .format(alternative.words[0].speaker_tag))
+    result = response.results[-1]
+    words_info = result.alternatives[0].words
+    pieces = ['%s (%s)' % (word_info.word, word_info.speaker_tag)


Isn't '{} ({})'.format(word_info.word, word_info.speaker_tag) the preferred way to do this these days? Or is Thea no longer benevolent overlord?

Good point. I will modify it.

puneith · 2018-07-20T18:14:10Z

speech/cloud-client/beta_snippets.py

@@ -46,7 +46,6 @@ def transcribe_file_with_enhanced_model(speech_file):
    audio = speech.types.RecognitionAudio(content=content)
    config = speech.types.RecognitionConfig(
        encoding=speech.enums.RecognitionConfig.AudioEncoding.LINEAR16,
-        sample_rate_hertz=8000,


Why is sample_rate removed? I am sure there is good reason.

We just had a discussion about it with Roy and Jerjou. If the input file has a different sample rate, it will cause an error. It is simpler just to omit it and the API figures it out on its own.

puneith · 2018-07-20T18:15:35Z

speech/cloud-client/beta_snippets_test.py

@@ -51,10 +51,10 @@ def test_transcribe_file_with_auto_punctuation(capsys):

 def test_transcribe_diarization(capsys):
    transcribe_file_with_diarization(
-        os.path.join(RESOURCES, 'Google_Gnome.wav'))
+        os.path.join(RESOURCES, 'commercial_mono.wav'))


Consider adding file name as an argument and not hardcode.

Even for the unit tests?

puneith · 2018-07-20T18:22:44Z

speech/cloud-client/beta_snippets.py

+    words_info = result.alternatives[0].words
+    pieces = ['%s (%s)' % (word_info.word, word_info.speaker_tag)
+              for word_info in words_info]
+    print(' '.join(pieces))


I think I agree with the way it's formatted but whatever you decide consider this syntax

pieces = ['{} ({})'.format(word_info.word, word_info.speaker_tag) for word_info in words_info]

puneith · 2018-07-20T18:22:55Z

speech/cloud-client/beta_snippets.py

-              .format(alternative.words[0].speaker_tag))
+    result = response.results[-1]
+    words_info = result.alternatives[0].words
+    pieces = ['%s (%s)' % (word_info.word, word_info.speaker_tag)


pieces is an odd variable name.

puneith · 2018-07-20T18:24:16Z

speech/cloud-client/beta_snippets.py

@@ -156,21 +153,18 @@ def transcribe_file_with_diarization(speech_file):

    config = speech.types.RecognitionConfig(
        encoding=speech.enums.RecognitionConfig.AudioEncoding.LINEAR16,
-        sample_rate_hertz=16000,
        language_code='en-US',
        enable_speaker_diarization=True,
        diarization_speaker_count=2)

    print('Waiting for operation to complete...')


Consider using python logging facility. Understandably, for this sample it might be overkill so take it or leave it.

FWIW nearly all our other Python samples do print(). It's true that it's not always the recommended practice in production, but it's easy to understand. With logging there's always the risk that the developer has some weird config where the logs end up where maybe they don't expect.

happyhuman · 2018-07-20T20:58:04Z

@dizcology , @puneith, I made some changes based on the comments made by all the reviewers? Can you please take a final look and approve the PR if it looks good?

jerjou · 2018-07-20T21:03:36Z

speech/cloud-client/beta_snippets.py

-        print('Speaker Tag for the first word: {}'
-              .format(alternative.words[0].speaker_tag))
+    # response.results contains partial results with the last item
+    # containing the entire result:


Mm... not quite. The transcript within each result is separate and sequential per result. However, the words list within an alternative (for whatever reason) includes all the words from all the results thus far. Thus, to get all the words with speaker tags, you only have to take the words list from the last result.

I see. Thanks for the clarification. Let me update the comment.

I'm not really understanding your comment, @jerjou but this sounds like something that needs to be documented on the cloud.google.com docs with a briefer explanation in the sample itself.

jerjou · 2018-07-20T21:16:34Z

speech/cloud-client/beta_snippets.py

+        if speakers_words and speakers_words[-1][0] == word_info.speaker_tag:
+            speakers_words[-1][1].append(word_info.word)
+        else:
+            speakers_words.append((word_info.speaker_tag, [word_info.word, ]))


This is a bit hard to read. An intermediate variable and a namedtuple would go a long way to making this clearer:

Speaker = collections.namedtuple('Speaker', ['tag', 'words']) speaker_words = [Speaker(tag=0, words=[])] for word_info in words_info: current_speaker = speaker_words[-1] if current_speaker.tag == word_info.speaker_tag: current_speaker.words.append(word_info.word) else: speaker_words.append(Speaker(tag=word_info.speaker_tag, words=[word_info.word]))

Also, maybe speaker_sequence or something, to make it clear it's not just a speaker->words_they_spoke mapping, and is actually the conversation / words spoken in sequence.

Interesting idea. However, I think the code readability is mainly the result of the intermediate variable Speaker(tag=0, words=[]) as the first element in the list, allowing us to define current_speaker every time.
The downside of it is that we are introducing a new object that is not returned by the API and we will have to handle it in the next step separately (either by removing it, or by skipping it), which reduces the code cleanness in another way.
So, while I am not against the suggested solution, I am also not sure if it is really helping with the code readability that much.
What do you think @jerjou ?

What about this:

speakers = [] words = [] for word_info in words_info: if (not speakers) or speakers[-1] != word_info.speaker_tag: speakers.append(word_info.speaker_tag) words.append([]) words[-1].append(word_info.word)

I think this is more readable that the current code, without introducing the intermediate variable.

jerjou · 2018-07-20T21:19:56Z

speech/cloud-client/beta_snippets.py

    elif args.command == 'multi-language':
-        transcribe_file_with_multilanguage(args.path, args.first, args.second)


For future reference, argparse's sub-commands feature would be helpful to avoid having args that only matter for one command or another.

Sounds good.

puneith · 2018-07-20T21:21:30Z

speech/cloud-client/beta_snippets.py

+    # Separating the words by who said what:
+    speakers_words = []
+    for word_info in words_info:
+        if speakers_words and speakers_words[-1][0] == word_info.speaker_tag:


If I understand correctly what this piece of for loop is doing (creating list of words for tag) isn't is better to use hashmap. Does this loop work what it's supposed to do?

Jerjou is already reviewing this code so I should stop reviewing. You are already is great hands :)

tswast · 2018-07-20T22:40:11Z

speech/cloud-client/beta_snippets.py

@@ -156,21 +153,18 @@ def transcribe_file_with_diarization(speech_file):

    config = speech.types.RecognitionConfig(
        encoding=speech.enums.RecognitionConfig.AudioEncoding.LINEAR16,
-        sample_rate_hertz=16000,
        language_code='en-US',
        enable_speaker_diarization=True,
        diarization_speaker_count=2)

    print('Waiting for operation to complete...')


FWIW nearly all our other Python samples do print(). It's true that it's not always the recommended practice in production, but it's easy to understand. With logging there's always the risk that the developer has some weird config where the logs end up where maybe they don't expect.

tswast · 2018-07-20T22:42:32Z

speech/cloud-client/beta_snippets.py

-        print('Speaker Tag for the first word: {}'
-              .format(alternative.words[0].speaker_tag))
+    # The transcript within each result is separate and sequential per result.
+    # However, the words list within an alternative (for whatever reason)


I understand that this comment is to explain why the [-1], but I'm having trouble understanding what this is referring to. (Probably because I don't know what an "alternative" means in this context.)

I think you could probably get by with fewer words within the sample with just

# To get all words with speaker tags, you only have to take the words list from the last result.

and leave the extra explanation about alternatives for the actual cloud.google.com docs.

I initially had a shorter comment. But Jerjou suggested to explain it more, so I added the longer comment. I am okay either way, but I also think this is just a sample and not the formal documentation and the sample does not need to go over all the details of what's what.

tswast · 2018-07-20T22:49:01Z

speech/cloud-client/beta_snippets.py

+    speakers = []
+    words = []
+    for word_info in words_info:
+        if (not speakers) or speakers[-1] != word_info.speaker_tag:


I think this implementation means that if speakers appear in certain orders you could end up with duplicates. For example.

word_info speakers: A, B, B, A would result in a speakers lists: A, B, A.

I think this sample would be clearer with a single dictionary of speakers to words rather than two lists.

We had a long discussion about the code readability of this part of the sample. The API returns each word with an associated speak_tag. So, if I am speaker #1 and I say "I am ok", the words 'I' and 'am' and 'okay' will all be tagged with speaker_tag: 1.

Since the point of the sample is mostly just to showcase the API call and how to unbox the response object, I am beginning to think that this particular implementation is adding more confusion than clarity. I am leaning back toward a simpler implementation that just iterates over words and prints each word along with the associated speaker_tag. What do you think?

Yes, printing each word with a speaker tag would be much clearer.

You could do something like

current_speaker = None for word_info in words_info: if current_speaker is None or current_speaker != word_info.speaker_tag: current_speaker = word_info.speaker_tag print() print(current_speaker) print(word_info.word)

if you wanted to keep them grouped.

tswast

The new loop for speaker tags is much easier to understand, thanks.

Talked to Puneith offline about it. There is no issue to be resolved.

jerjou · 2018-07-31T19:16:07Z

Shoot. I totally dropped off this PR, didn't I. Sorry - bad email management -_-; Glad y'all just went on without me.

…-samples#1586) * Printing the last paragraph only. * Python3 print * Removing sample rate setting * Adding the missing output parameter in the example * Changes based on the comments * Removed filenames as input parameters * Removed unused args * Updated README file * Updated the inline comment * Modified code to make it more readable * Simplified the response object processing. * Fixing the long line issue.

* Printing the last paragraph only. * Python3 print * Removing sample rate setting * Adding the missing output parameter in the example * Changes based on the comments * Removed filenames as input parameters * Removed unused args * Updated README file * Updated the inline comment * Modified code to make it more readable * Simplified the response object processing. * Fixing the long line issue.

…-samples#1586) * Printing the last paragraph only. * Python3 print * Removing sample rate setting * Adding the missing output parameter in the example * Changes based on the comments * Removed filenames as input parameters * Removed unused args * Updated README file * Updated the inline comment * Modified code to make it more readable * Simplified the response object processing. * Fixing the long line issue.

Printing the last paragraph only.

b5d4ceb

googlebot added the cla: yes This human has signed the Contributor License Agreement. label Jul 19, 2018

Python3 print

a4c2ca4

happyhuman requested review from tswast, dizcology and sirtorry July 19, 2018 22:24

happyhuman added 2 commits July 20, 2018 08:57

Removing sample rate setting

f7e4131

Merge branch 'master' into diarization-fix

5949a81

dizcology reviewed Jul 20, 2018

View reviewed changes

Adding the missing output parameter in the example

f1662fe

jerjou reviewed Jul 20, 2018

View reviewed changes

puneith previously requested changes Jul 20, 2018

View reviewed changes

happyhuman added 4 commits July 20, 2018 12:01

Changes based on the comments

4fbefa3

Removed filenames as input parameters

b105e2a

Removed unused args

b53296a

Updated README file

46c1f43

jerjou reviewed Jul 20, 2018

View reviewed changes

puneith reviewed Jul 20, 2018

View reviewed changes

happyhuman added 2 commits July 20, 2018 14:33

Updated the inline comment

99ed289

Modified code to make it more readable

3ef4a0d

tswast reviewed Jul 20, 2018

View reviewed changes

Simplified the response object processing.

146a180

tswast approved these changes Jul 20, 2018

View reviewed changes

Fixing the long line issue.

597dc0a

sirtorry approved these changes Jul 20, 2018

View reviewed changes

happyhuman merged commit c310941 into master Jul 20, 2018

happyhuman deleted the diarization-fix branch July 20, 2018 23:20

abhishek2690 mentioned this pull request Aug 1, 2018

Fix indentation bug in docs #1595

Merged

msampathkumar mentioned this pull request Nov 10, 2022

migrate code from googleapis/python-texttospeech #8483

Merged

8 tasks

telpirion mentioned this pull request Jan 13, 2023

chore(speech): migrate code from googleapis/python-speech #8982

Merged

8 tasks

		elif args.command == 'multi-language':
		transcribe_file_with_multilanguage(args.path, args.first, args.second)

Diarization Output Modified #1586

Diarization Output Modified #1586

Conversation

happyhuman commented Jul 19, 2018

happyhuman commented Jul 19, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

happyhuman commented Jul 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

happyhuman Jul 20, 2018 • edited Loading

Choose a reason for hiding this comment

happyhuman Jul 20, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

happyhuman Jul 20, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tswast left a comment

Choose a reason for hiding this comment

jerjou commented Jul 31, 2018

happyhuman commented Jul 19, 2018 •

edited

Loading

happyhuman Jul 20, 2018 •

edited

Loading

happyhuman Jul 20, 2018 •

edited

Loading

happyhuman Jul 20, 2018 •

edited

Loading