7

I have approximately 20000 pieces of texts to translate, each of which average around the length of 100 characters. I am using the multiprocessing library to speed up my API calls. And looks like below:

from google.cloud.translate_v2 import Client from time import sleep from tqdm.notebook import tqdm import multiprocessing as mp os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = cred_file translate_client = Client() def trans(text, MAX_TRIES=5): res = None sleep_time = 1 for i in range(MAX_TRIES): try: res = translate_client.translate(text, target_language="en", model="nmt") error = None except Exception as error: pass if res is None: sleep(sleep_time) # wait for 1 seconds before trying to fetch the data again sleep_time *= 2 else: break return res["translatedText"] src_text = # eg. ["this is a sentence"]*20000 with mp.Pool(mp.cpu_count()) as pool: translated = list(tqdm(pool.imap(trans, src_text), total=len(src_text))) 

The above code unfortunately fails around iteration 2828 +/- 5 every single time (HTTP Error 503: Service Unavailable). I was hoping that having a variable sleep time would let it restart and run as normal. Weird thing is that if I was to restart the loop straight away, it starts again without issue, even though < 2^4 seconds have passed since the code finished execution. So the questions are:

  1. Am I doing the try/except bit wrong?
  2. Is doing the multiprocessing somehow affecting the API.
  3. General thoughts?

I need the multiprocessing because otherwise I would be waiting for around 3 hours for the whole thing to finish.

6
  • 1
    How does it fail? Commented Jun 29, 2020 at 0:10
  • 1
    @sheepez updated error to say HTTP Error 503: Service Unavailable. Commented Jun 29, 2020 at 0:19
  • 503 tells us it's an issue on Google's end, searching around I can see others have had a similar experience to you. Out of interest, are you able to pinpoint the failure to a specific piece of text; as you mentioned it fails on a specific iteration? Commented Jul 1, 2020 at 8:58
  • Instead of doing arbitrary sleep, you could check if 503 response contains a Retry-After header with a delay or a date to retry. See developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Retry-After Commented Jul 1, 2020 at 10:31
  • Can you try with sleep_time = 4 and sleep_time *= 4? Commented Jul 2, 2020 at 14:06

3 Answers 3

8

Some thoughts, the google APIs tried before, can only handle a certain number of concurrent requests, and if the limit is reached, the service will return the error HTTP 503 "Service Unavailable." And HTTP 403 if the Daily limit is Exceeded or User Rate Limit.

Try to implement retries with exponential backoff. Retry an operation with an exponentially increasing waiting time, up to a max retry count has been reached. It will improve the bandwidth usage and maximize throughput of requests in concurrent environments.

And review the Quotas and Limits page.

Sign up to request clarification or add additional context in comments.

1 Comment

If Googles Translate API limit is 6 million character per minute and the test send 360, 000 characters. Then why would the limit be reached?
4
+100

Google API is excellent at hiding the complexities of preforming Google Translation. Unfortunately, if you step into Google API code, it’s using standard HTTP requests. This means that when you’re running 20, 000 plus requests, regardless of thread pooling, there will be a huge bottle neck.

Consider creating HTTP requests using aiohttp (you’ll need to install from pip) and asyncio. This will allow you to run asynchronous HTTP requests. (It means you don’t need to use google.cloud.translate_v2, multiprocessing or tqdm.notebook).

Simply call an await method in asyncio.run(), the method can creates an array of methods to preform aiohttp.session.get(). Then call asyncio.gather() to collect all the results.

In the example below I'm using an API key https://console.cloud.google.com/apis/credentials (instead of Google Application Credential / Service Accounts).

Using your example with asyncio & aiohttp, it ran in 30 seconds and without any errors. (Although you might want to extend timeout to session).

It's worth pointing out that Google has a limit of 6 million characters per minute. Your test is doing 360,000. Therefore you'll reach the limit if you run the test 17 times in a minute!

Also the speed is mainly determined by the machine and not Google API. (I ran my tests on a pc with 3GHz, 8 core and 16GB ram).

import asyncio import aiohttp from collections import namedtuple import json from urllib.parse import quote TranslateReponseModel = namedtuple('TranslateReponseModel', ['sourceText', 'translatedText', 'detectedSourceLanguage']) # model to store results. def Logger(json_message): print(json.dumps(json_message)) # Note: logging json is just my personal preference. async def DownloadString(session, url, index): while True: # If client error - this will retry. You may want to limit the amount of attempts try: r = await session.get(url) text = await r.text() #Logger({"data": html, "status": r.status}) r.raise_for_status() # This will error if API return 4xx or 5xx status. return text except aiohttp.ClientConnectionError as e: Logger({'Exception': f"Index {index} - connection was dropped before we finished", 'Details': str(e), 'Url': url }) except aiohttp.ClientError as e: Logger({'Exception': f"Index {index} - something went wrong. Not a connection error, that was handled", 'Details': str(e), 'Url': url}) def FormatResponse(sourceText, responseText): jsonResponse = json.loads(responseText) return TranslateReponseModel(sourceText, jsonResponse["data"]["translations"][0]["translatedText"], jsonResponse["data"]["translations"][0]["detectedSourceLanguage"]) def TranslatorUriBuilder(targetLanguage, sourceText): apiKey = 'ABCDED1234' # TODO This is a 41 characters API Key. You'll need to generate one (it's not part of the json certificate) return f"https://translation.googleapis.com/language/translate/v2?key={apiKey}={quote(sourceText)}&target={targetLanguage}" async def Process(session, sourceText, lineNumber): translateUri = TranslatorUriBuilder('en', sourceText) # Country code is set to en (English) translatedResponseText = await DownloadString(session, translateUri, lineNumber) response = FormatResponse(sourceText, translatedResponseText) return response async def main(): statements = ["this is another sentence"]*20000 Logger({'Message': f'Start running Google Translate API for {len(statements)}'}) results = [] async with aiohttp.ClientSession() as session: results = await asyncio.gather(*[Process(session, val, idx) for idx, val in enumerate(statements)] ) Logger({'Message': f'Results are: {", ".join(map(str, [x.translatedText for x in results]))}'}) Logger({'Message': f'Finished running Google Translate API for {str(len(statements))} and got {str(len(results))} results'}) if __name__ == '__main__': asyncio.run(main()) 

Additional test

The initial test is running the same translation. Therefore I’ve created a test to check the results are not being cached on Google. I manually copied an eBook into a text file. Then in Python, the code opens the file and groups the text into array of 100 characters and then take the first 20,000 item from the array and translate each row. Interestingly it still took under 30 seconds.

import asyncio import aiohttp from collections import namedtuple import json from urllib.parse import quote TranslateReponseModel = namedtuple('TranslateReponseModel', ['sourceText', 'translatedText', 'detectedSourceLanguage']) # model to store results. def Logger(json_message): print(json.dumps(json_message)) # Note: logging json is just my personal preference. async def DownloadString(session, url, index): while True: # If client error - this will retry. You may want to limit the amount of attempts try: r = await aiohttp.session.get(url) text = await r.text() #Logger({"data": html, "status": r.status}) r.raise_for_status() # This will error if API return 4xx or 5xx status. return text except aiohttp.ClientConnectionError as e: Logger({'Exception': f"Index {index} - connection was dropped before we finished", 'Details': str(e), 'Url': url }) except aiohttp.ClientError as e: Logger({'Exception': f"Index {index} - something went wrong. Not a connection error, that was handled", 'Details': str(e), 'Url': url}) def FormatResponse(sourceText, responseText): jsonResponse = json.loads(responseText) return TranslateReponseModel(sourceText, jsonResponse["data"]["translations"][0]["translatedText"], jsonResponse["data"]["translations"][0]["detectedSourceLanguage"]) def TranslatorUriBuilder(targetLanguage, sourceText): apiKey = 'ABCDED1234' # TODO This is a 41 characters API Key. You'll need to generate one (it's not part of the json certificate) return f"https://translation.googleapis.com/language/translate/v2?key={apiKey}={quote(sourceText)}&target={targetLanguage}" async def Process(session, sourceText, lineNumber): translateUri = TranslatorUriBuilder('en', sourceText) # Country code is set to en (English) translatedResponseText = await DownloadString(session, translateUri, lineNumber) response = FormatResponse(sourceText, translatedResponseText) return response def readEbook(): # This is a simple test to make sure response is not cached. # I grabbed a random online pdf (http://sd.blackball.lv/library/Beginning_Software_Engineering_(2015).pdf) and copied text into notepad. with open("C:\\Dev\\ebook.txt", "r", encoding="utf8") as f: return f.read() def chunkText(text): chunk_size = 100 chunks= len(text) chunk_array = [text[i:i+chunk_size] for i in range(0, chunks, chunk_size)] formatResults = [x for x in chunk_array if len(x) > 10] return formatResults[:20000] async def main(): data = readEbook() chunk_data = chunkText(data) Logger({'Message': f'Start running Google Translate API for {len(chunk_data)}'}) results = [] async with aiohttp.ClientSession() as session: results = await asyncio.gather(*[Process(session, val, idx) for idx, val in enumerate(chunk_data)] ) Logger({'Message': f'Results are: {", ".join(map(str, [x.translatedText for x in results]))}'}) Logger({'Message': f'Finished running Google Translate API for {str(len(chunk_data))} and got {str(len(results))} results'}) if __name__ == '__main__': asyncio.run(main()) 

Finally you can find more info about the Google Translate API HTTP request https://cloud.google.com/translate/docs/reference/rest/v2/translate and you can run the request through Postman.

Comments

3

A 503 error implies that this issue is on Google's side, which leads me to believe you're possibly getting rate limited. As Raphael mentioned, is there a Retry-After header in the response? I recommend taking a look into the response headers as it'll likely tell you what's going on more specifically, and possibly give you info on how to fix it.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.