-
Notifications
You must be signed in to change notification settings - Fork 881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to process 2 pages in diffrent threads? #623
Comments
Playwright isn't thread safe so you need to start playwright separately for each thread or you can use from playwright.sync_api import sync_playwright
from threading import Thread
def run1():
with sync_playwright() as playwright:
browser = playwright.chromium.launch(headless=False)
page = browser.new_page()
page.goto("https://google.com")
page.wait_for_timeout(1000)
page.close()
def run2():
with sync_playwright() as playwright:
browser = playwright.chromium.launch(headless=False)
page = browser.new_page()
page.goto("https://google.com")
page.wait_for_timeout(1000)
page.close()
def main():
t = Thread(target=run1)
t1 = Thread(target=run2)
t.start()
t1.start()
t.join()
t1.join()
if __name__ == "__main__":
main() |
Closed as part of the triage and no response. |
Do I understand this one correctly, that if I create a new playwright object inside a newly opened thread it will be stable? |
exactly, this should work. |
Hmm, okay that confuses me because according to #470 playwright is not threadsafe, am I misinterpreting this issue? |
ah maybe I miss-interpreted it. If you create a new playwright instance in each thread, then it works fine. Sharing it with different threads does not work or is not stable. |
I understand you response as in, that in #470 the reply we got was incorrect, because as far as I understand the following snippet, - "create a new playwright instance in each thread" - is what we do: from playwright.sync_api import sync_playwright
from concurrent.futures.thread import ThreadPoolExecutor
def sample_func():
playwright_instance = sync_playwright().start()
browser = playwright_instance.firefox.launch()
page = browser.new_page()
page.goto('http://whatsmyuseragent.org/')
page.screenshot(path=f'example-{browser_type.name}.png')
browser.close()
tpe = ThreadPoolExecutor()
for _ in range(100):
tpe.submit(sample_func)
tpe.shutdown() Is my assumption correct, that in contrary to the response in #470, that playwright is not threadsafe, that it is indeed threadsafe, if a new playwright instance is created inside each thread - As it is being done in the code snippet above? |
Yes this should work but I would not recommend spawning as many Playwright instances and do the scaling at that end. e.g. over multiple instances running on k8 and syncing over messaging queues like RabbitMQ would perform and scale better. |
Thank you for the quick responses. I dont' fully understand what you mean by "do the scaling at that end, over multiple instances running on k8" - I am assuming, that you mean to not use Threading at all but create the logic in a way, that "if executed once it will execute one and only one instance" and then use kubernetes to have it run multiple times, like in the threading example? |
Fyi: @chwba Thread safe means that it is not safe to call method or access attributes from a different thread. You can always create a playwright in TLS and it would be safe. |
Understood. The reason my colleague had opened #470 was, that using playwright, as described on the code snippet(s), we were experiencing heavy "weird" behaviour of playwright, as in web-elements not being detected, which did indeed exist, inexplicable exceptions, that did not make much sense on first sight - None of this was happening if running only one instance. Now I am looking for the best possible way to run multiple instances of playwright (roughly 100 simultaneously running instances at max, at the moment), but keeping playwright stable still - Now after @mxschmitt 's response I have the feeling Threading would not be the right way to go, and Multiprocessing will eat up even more resources and expose additional challenges, especially if we create it async. |
For using playwright in a multithreaded environment it is recommended to use TLS hence it would be safe and it will never cause wierd exceptions. You can avoid multithreading if you choose to use ProcessPoolExecutor and then use TLS per process to isolate it and hence it would be safe. |
What do you mean by use TLS? From how I interpret "use TLS", it means for me connect to the target website via "https" instead of "http", is this correct or am I misunderstanding what you mean? Maybe I should clarify: The application will be running on a single dedicated server (32cores, 256gb ram), and we are not an organization but rather freelancers, who create automatic testing solutions for the websites we create. |
Here by TLS I meant thread local storage which is like contextvars python module but for threads not asyncio. It is the recommended to use in multithreaded envs to store connection objs etc. and can also be used for Playwright |
I have to admit, that I havent worked with contextvars before. - Would it be possible to maybe create a small snippet for us, that shows how you would suggest to go forward, keeping in mind the ThreadPoolExecutor snippet I had posted above? EDIT: I did some digging regarding TLS and found https://stackoverflow.com/questions/1408171/thread-local-storage-in-python - do you mean, that we should instantiate the playwright instances inside the threads in such a thread-local namespace? |
Okay, so this is what we have come up with, hope, we have understood your suggestions correctly: import random
import threading
from concurrent.futures.thread import ThreadPoolExecutor
from time import sleep
from loguru import logger
from playwright.sync_api import Playwright, BrowserType, BrowserContext, Page
from playwright.sync_api import sync_playwright
class Tls(threading.local):
def __init__(self):
self.playwright: Playwright = None
self.browser: BrowserType = None
self.context: BrowserContext = None
self.page: Page = None
class Generator:
tls = Tls()
def __init__(self):
pass
def run(self, k):
logger.info("THREAD: %s - ENTER" % k)
self.tls.playwright = sync_playwright().start()
self.tls.browser = self.tls.playwright.firefox.launch(headless=True)
self.tls.context = self.tls.browser.new_context(
bypass_csp=True,
ignore_https_errors=True,
color_scheme=random.choice(["dark", "light", "no-preference"]),
timezone_id=None,
geolocation={"longitude": 1, "latitude": 2},
locale="en-US",
java_script_enabled=True,
user_agent=None,
)
self.tls.page = self.tls.context.new_page()
self.tls.page.goto("https://google.com")
self.tls.page.screenshot(path=f'{random.randint(100, 10000)}.png')
self.tls.page.close()
self.tls.context.close()
self.tls.browser.close()
self.tls.playwright.stop()
logger.info("THREAD: %s - EXIT" % k)
if __name__ == "__main__":
generators = list()
tpe = ThreadPoolExecutor()
for i in range(1, 11):
generator = Generator()
generators.append(generator)
tpe.submit(generator.run, i)
sleep(0.1)
tpe.shutdown(wait=False)
while sum([int(t.is_alive()) for t in tpe._threads]) > 1:
sleep(3) Going on from this, I would have two more questions regarding stability:
Im not sure how playwright works internally but I could imagine, that at a certain amount of pages/contexts inside one browser, which ultimately result in new browser tabs each, that it can get instable, if we open too many inside one browser instance. |
Here is a snippet which uses TLS i.e. Thread Local Storage and other best practises for multithreaded safe playwright script: import threading
from playwright.sync_api import sync_playwright
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
class Tls(threading.local):
def __init__(self) -> None:
self.playwright = sync_playwright().start()
print("Create playwright instance in Thread", threading.current_thread().name)
class Worker:
tls = Tls()
def run(self):
print("Launched worker in ", threading.current_thread().name)
browser = self.tls.playwright.chromium.launch(headless=False)
context = browser.new_context()
page = browser.new_page()
page.goto("http://whatsmyuseragent.org/")
page.screenshot(path=f"example-{threading.current_thread().name}.png")
page.close()
context.close()
browser.close()
print("Stopped worker in ", threading.current_thread().name)
if __name__ == "__main__":
with ThreadPoolExecutor(max_workers=5) as executor:
for _ in range(50):
worker = Worker()
executor.submit(worker.run) Here each thread creates its own playwright object if it is different thread else it reuses the playwright and uses tls to store it hence it is thread safe and wouldn't causes errors or race conditions, you can change Here is the gist url https://gist.github.com/kumaraditya303/e6dee949dda298b35d167369955d45c6 |
Yes, you should create a new context per thread
If you launch one playwright then it creates two subprocesses, one driver and one the browser you want to use hence, I would say you can create multiple contexts around ~3 per thread.
Do not try to create multiple playwright per thread, always use thread local to isolate each other, |
I missed that, if you have 32 cores then you should minimum use 3~4 contexts per thread and have 64 threads. That would be a good start |
Thank you for the thorough replies. One last question: If we go ahead and use 3-4 contexts per thread, while each context will work on different tests - I haven't come up with a different approach but opening another 3-4 threads from inside this very Thread (which wouldn't work because these new threads would not have access to the thread-local playwright anymore). See the below code snippet to clarify: import random
import threading
from concurrent.futures.thread import ThreadPoolExecutor
from time import sleep
from loguru import logger
from playwright.sync_api import Playwright, BrowserType, BrowserContext, Page
from playwright.sync_api import sync_playwright
class Tls(threading.local):
def __init__(self):
self.playwright: Playwright = None
self.browser: BrowserType = None
self.context: BrowserContext = None
self.page: Page = None
class Generator:
tls = Tls()
def __init__(self):
pass
def run(self, k):
logger.info("THREAD: %s - ENTER" % k)
self.tls.playwright = sync_playwright().start()
self.tls.browser = self.tls.playwright.firefox.launch(headless=True)
# Create 3 different contexts
self.tls.context = self.tls.browser.new_context( )
self.tls.second_context = self.tls.browser.new_context( )
self.tls.third_context = self.tls.browser.new_context( )
# Create 3 different pages
self.tls.page = self.tls.context.new_page()
self.tls.second_page = self.tls.context.new_page()
self.tls.third_page = self.tls.context.new_page()
# Navigate to 3 different websites
self.tls.page.goto("https://google.com")
self.tls.second_page .goto("https://web.de")
self.tls.page.third_page.goto("https://cnn.com")
# do separate work on each of the pages
# how can we achieve concurrency in this scenario?
self.tls.page.close()
self.tls.second_page.close()
self.tls.third_page.close()
self.tls.context.close()
self.tls.second_context.close()
self.tls.third_context.close()
self.tls.browser.close()
self.tls.playwright.stop()
logger.info("THREAD: %s - EXIT" % k)
if __name__ == "__main__":
generators = list()
tpe = ThreadPoolExecutor()
for i in range(1, 11):
generator = Generator()
generators.append(generator)
tpe.submit(generator.run, i)
sleep(0.1)
tpe.shutdown(wait=False)
while sum([int(t.is_alive()) for t in tpe._threads]) > 1:
sleep(3) |
@chwba First I would recommend you to initialise playwright in the thread local constructor so that it can be reused as in my earlier example I gave earlier. I meant by 3~4 contexts that you want to process them synchronously, but as stated by you above that you need concurrency there too, then you should create one context per thread and parallelize them with thread, then you should try to increase thread by 64 to higher till your server handles it correctly since 64 threads means 64 subprocesses and may be you will be out of memory. |
@kumaraditya303 Thank you again for the thorough replies. I wanted to let you know, that we have now finished creating the solution - We went for 1 playwright/context/page per thread and were able to achieve 45-50 simultaneous threads running headless firefox browsers with a CPU load of ~85% on average with peaks to 100%, ram is no issue at all, we got only 30% used on full load. If we increase to 55 or more, the CPU load will stay at 100% though and it does not feel safe to do that over a long period x). Do you have any hints regarding how we could tweak playwright a little further to maybe squeeze out another couple threads? PS: Regarding the local constructor, we had left it on None, because we thought it would start a playwright process also in the main process in that case and we only run playwright inside the threads, outside we only print statistics and do management of everything else. |
@chwba The CPU load is fair as there are would be around 50 processes running simultaneously of browsers, I don't think you could do more threads at this time. However, if you want even more performance out of the it, you can combine threading and asyncio to get even better performance which would give you more performance in the same CPU load but then you would use multiple contexts and hence it would use more RAM |
can you show how to do it correctly with asyncio? and does firefox have any arguments like https://stackoverflow.com/a/58589026 to reduce the load on the processors? (I have many contexts running. 1 browser per process and many contexts per thread). first I start the process -> start the asyncio loop in the process -> start the browser -> create about 5 threads, transfer the browser -> start about 5 contexts per thread. - This loads the server heavily (6 cores, 32 RAM), cores are loaded at 100% and then the browser is simply closed and "Target page, context or browser has been closed" (( |
Same problem |
@Konano create a new issue for asyncio |
@kumaraditya303 you mean this problem is related to asyncio, not to playwright? |
No create a new issue on playwright-python repo to separate the discussion from this one |
i want to open 2 pages and create 2 thread,thread 1 process page1,and thread 2 process page 2.
i try this code
but first open page 1 and 5 seconds later ,it opens page 2.
it oepens the pages one by one
how can i process multi page in diffrent thread at same time
The text was updated successfully, but these errors were encountered: