Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to process 2 pages in diffrent threads? #623

Closed
czcz1024 opened this issue Apr 16, 2021 · 28 comments
Closed

how to process 2 pages in diffrent threads? #623

czcz1024 opened this issue Apr 16, 2021 · 28 comments
Labels

Comments

@czcz1024
Copy link

czcz1024 commented Apr 16, 2021

i want to open 2 pages and create 2 thread,thread 1 process page1,and thread 2 process page 2.
i try this code

def run1(context):
    page = context.new_page()
    page.goto('https://page1')
    page.wait_for_timeout(5000)
    page.close()

def run2(context):
    page = context.new_page()
    page.goto('https://page2')
    page.wait_for_timeout(1000)
    page.close()

def main():
    with sync_playwright() as playwright:
        browser = playwright.chromium.launch(headless=False)
        context = browser.new_context()
        t=Thread(target=run1,args=(context,))
        t1=Thread(target=run2,args=(context,))
        t.start()
        t1.start()
        t.join()
        t1.join()
        context.close()
        browser.close()

but first open page 1 and 5 seconds later ,it opens page 2.
it oepens the pages one by one
how can i process multi page in diffrent thread at same time

@kumaraditya303
Copy link
Contributor

Playwright isn't thread safe so you need to start playwright separately for each thread or you can use asyncio

from playwright.sync_api import sync_playwright
from threading import Thread


def run1():
    with sync_playwright() as playwright:
        browser = playwright.chromium.launch(headless=False)
        page = browser.new_page()
        page.goto("https://google.com")
        page.wait_for_timeout(1000)
        page.close()


def run2():
    with sync_playwright() as playwright:
        browser = playwright.chromium.launch(headless=False)
        page = browser.new_page()
        page.goto("https://google.com")
        page.wait_for_timeout(1000)
        page.close()


def main():
    t = Thread(target=run1)
    t1 = Thread(target=run2)
    t.start()
    t1.start()
    t.join()
    t1.join()


if __name__ == "__main__":
    main()

@mxschmitt
Copy link
Member

Closed as part of the triage and no response.

@sla-te
Copy link

sla-te commented May 12, 2021

Do I understand this one correctly, that if I create a new playwright object inside a newly opened thread it will be stable?

@mxschmitt
Copy link
Member

exactly, this should work.

@sla-te
Copy link

sla-te commented May 12, 2021

Hmm, okay that confuses me because according to #470 playwright is not threadsafe, am I misinterpreting this issue?

@mxschmitt
Copy link
Member

ah maybe I miss-interpreted it. If you create a new playwright instance in each thread, then it works fine. Sharing it with different threads does not work or is not stable.

@sla-te
Copy link

sla-te commented May 12, 2021

I understand you response as in, that in #470 the reply we got was incorrect, because as far as I understand the following snippet, - "create a new playwright instance in each thread" - is what we do:

from playwright.sync_api import sync_playwright
from concurrent.futures.thread import ThreadPoolExecutor

def sample_func():
    playwright_instance = sync_playwright().start()
    browser = playwright_instance.firefox.launch()
    page = browser.new_page()
    page.goto('http://whatsmyuseragent.org/')
    page.screenshot(path=f'example-{browser_type.name}.png')
    browser.close()

tpe = ThreadPoolExecutor()
for _ in range(100):
    tpe.submit(sample_func)
tpe.shutdown()

Is my assumption correct, that in contrary to the response in #470, that playwright is not threadsafe, that it is indeed threadsafe, if a new playwright instance is created inside each thread - As it is being done in the code snippet above?

@mxschmitt
Copy link
Member

Yes this should work but I would not recommend spawning as many Playwright instances and do the scaling at that end. e.g. over multiple instances running on k8 and syncing over messaging queues like RabbitMQ would perform and scale better.

@sla-te
Copy link

sla-te commented May 12, 2021

Thank you for the quick responses.

I dont' fully understand what you mean by "do the scaling at that end, over multiple instances running on k8" - I am assuming, that you mean to not use Threading at all but create the logic in a way, that "if executed once it will execute one and only one instance" and then use kubernetes to have it run multiple times, like in the threading example?

@kumaraditya303
Copy link
Contributor

Fyi: @chwba Thread safe means that it is not safe to call method or access attributes from a different thread. You can always create a playwright in TLS and it would be safe.

@sla-te
Copy link

sla-te commented May 12, 2021

Fyi: @chwba Thread safe means that it is not safe to call method or access attributes from a different thread. You can always create a playwright in TLS and it would be safe.

Understood. The reason my colleague had opened #470 was, that using playwright, as described on the code snippet(s), we were experiencing heavy "weird" behaviour of playwright, as in web-elements not being detected, which did indeed exist, inexplicable exceptions, that did not make much sense on first sight - None of this was happening if running only one instance.

Now I am looking for the best possible way to run multiple instances of playwright (roughly 100 simultaneously running instances at max, at the moment), but keeping playwright stable still - Now after @mxschmitt 's response I have the feeling Threading would not be the right way to go, and Multiprocessing will eat up even more resources and expose additional challenges, especially if we create it async.

@kumaraditya303
Copy link
Contributor

For using playwright in a multithreaded environment it is recommended to use TLS hence it would be safe and it will never cause wierd exceptions. You can avoid multithreading if you choose to use ProcessPoolExecutor and then use TLS per process to isolate it and hence it would be safe.

@sla-te
Copy link

sla-te commented May 12, 2021

For using playwright in a multithreaded environment it is recommended to use TLS hence it would be safe and it will never cause wierd exceptions. You can avoid multithreading if you choose to use ProcessPoolExecutor and then use TLS per process to isolate it and hence it would be safe.

What do you mean by use TLS? From how I interpret "use TLS", it means for me connect to the target website via "https" instead of "http", is this correct or am I misunderstanding what you mean?

Maybe I should clarify: The application will be running on a single dedicated server (32cores, 256gb ram), and we are not an organization but rather freelancers, who create automatic testing solutions for the websites we create.

@kumaraditya303
Copy link
Contributor

kumaraditya303 commented May 12, 2021

Here by TLS I meant thread local storage which is like contextvars python module but for threads not asyncio. It is the recommended to use in multithreaded envs to store connection objs etc. and can also be used for Playwright

@sla-te
Copy link

sla-te commented May 12, 2021

Here by TLS I meant thread local storage which is like contextvars python module but for threads not asyncio. It is the recommended to use in multithreaded envs to store connection objs etc. and can also be used for Playwright

I have to admit, that I havent worked with contextvars before. - Would it be possible to maybe create a small snippet for us, that shows how you would suggest to go forward, keeping in mind the ThreadPoolExecutor snippet I had posted above?

EDIT: I did some digging regarding TLS and found https://stackoverflow.com/questions/1408171/thread-local-storage-in-python - do you mean, that we should instantiate the playwright instances inside the threads in such a thread-local namespace?

@sla-te
Copy link

sla-te commented May 12, 2021

Okay, so this is what we have come up with, hope, we have understood your suggestions correctly:

import random
import threading
from concurrent.futures.thread import ThreadPoolExecutor
from time import sleep

from loguru import logger
from playwright.sync_api import Playwright, BrowserType, BrowserContext, Page
from playwright.sync_api import sync_playwright


class Tls(threading.local):
    def __init__(self):
        self.playwright: Playwright = None
        self.browser: BrowserType = None
        self.context: BrowserContext = None
        self.page: Page = None


class Generator:
    tls = Tls()

    def __init__(self):
        pass

    def run(self, k):
        logger.info("THREAD: %s - ENTER" % k)

        self.tls.playwright = sync_playwright().start()
        self.tls.browser = self.tls.playwright.firefox.launch(headless=True)
        self.tls.context = self.tls.browser.new_context(
            bypass_csp=True,
            ignore_https_errors=True,
            color_scheme=random.choice(["dark", "light", "no-preference"]),
            timezone_id=None,
            geolocation={"longitude": 1, "latitude": 2},
            locale="en-US",
            java_script_enabled=True,
            user_agent=None,
        )
        self.tls.page = self.tls.context.new_page()
        self.tls.page.goto("https://google.com")
        self.tls.page.screenshot(path=f'{random.randint(100, 10000)}.png')

        self.tls.page.close()
        self.tls.context.close()
        self.tls.browser.close()
        self.tls.playwright.stop()

        logger.info("THREAD: %s - EXIT" % k)


if __name__ == "__main__":
    generators = list()
    tpe = ThreadPoolExecutor()
    for i in range(1, 11):
        generator = Generator()
        generators.append(generator)
        tpe.submit(generator.run, i)
        sleep(0.1)
    tpe.shutdown(wait=False)

    while sum([int(t.is_alive()) for t in tpe._threads]) > 1:
        sleep(3)

Going on from this, I would have two more questions regarding stability:

  1. Is it ok, to create a new, individual context for each new page we define on top of one playwright->browser instance?
  2. How many contexts/pages would you suggest to use at max per browser?
  3. Which of the following (a,b) would you suggest to prefer?
    a. One Playwright, one context, one page per thread
    b. One Playwright, n contexts, n pages per thread [n to be advised from your side, (see 2.) ]

Im not sure how playwright works internally but I could imagine, that at a certain amount of pages/contexts inside one browser, which ultimately result in new browser tabs each, that it can get instable, if we open too many inside one browser instance.

@kumaraditya303
Copy link
Contributor

kumaraditya303 commented May 13, 2021

I have to admit, that I havent worked with contextvars before. - Would it be possible to maybe create a small snippet for us, that shows how you would suggest to go forward, keeping in mind the ThreadPoolExecutor snippet I had posted above?

@chwba

Here is a snippet which uses TLS i.e. Thread Local Storage and other best practises for multithreaded safe playwright script:

import threading
from playwright.sync_api import sync_playwright
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor


class Tls(threading.local):
    def __init__(self) -> None:
        self.playwright = sync_playwright().start()
        print("Create playwright instance in Thread", threading.current_thread().name)


class Worker:
    tls = Tls()

    def run(self):
        print("Launched worker in ", threading.current_thread().name)
        browser = self.tls.playwright.chromium.launch(headless=False)
        context = browser.new_context()
        page = browser.new_page()
        page.goto("http://whatsmyuseragent.org/")
        page.screenshot(path=f"example-{threading.current_thread().name}.png")
        page.close()
        context.close()
        browser.close()
        print("Stopped worker in ", threading.current_thread().name)


if __name__ == "__main__":
    with ThreadPoolExecutor(max_workers=5) as executor:
        for _ in range(50):
            worker = Worker()
            executor.submit(worker.run)

Here each thread creates its own playwright object if it is different thread else it reuses the playwright and uses tls to store it hence it is thread safe and wouldn't causes errors or race conditions, you can change max_workers as per your needs.

Here is the gist url https://gist.github.com/kumaraditya303/e6dee949dda298b35d167369955d45c6

@kumaraditya303
Copy link
Contributor

kumaraditya303 commented May 13, 2021

Is it ok, to create a new, individual context for each new page we define on top of one playwright->browser instance?

Yes, you should create a new context per thread

How many contexts/pages would you suggest to use at max per browser?

If you launch one playwright then it creates two subprocesses, one driver and one the browser you want to use hence, I would say you can create multiple contexts around ~3 per thread.

Which of the following (a,b) would you suggest to prefer?
a. One Playwright, one context, one page per thread
b. One Playwright, n contexts, n pages per thread [n to be advised from your side, (see 2.) ]

Do not try to create multiple playwright per thread, always use thread local to isolate each other,
As stated above 2~3 context per browser should be a good amount

@kumaraditya303
Copy link
Contributor

kumaraditya303 commented May 13, 2021

Maybe I should clarify: The application will be running on a single dedicated server (32cores, 256gb ram), and we are not an organization but rather freelancers, who create automatic testing solutions for the websites we create.

I missed that, if you have 32 cores then you should minimum use 3~4 contexts per thread and have 64 threads. That would be a good start

@sla-te
Copy link

sla-te commented May 13, 2021

Thank you for the thorough replies.

One last question: If we go ahead and use 3-4 contexts per thread, while each context will work on different tests - I haven't come up with a different approach but opening another 3-4 threads from inside this very Thread (which wouldn't work because these new threads would not have access to the thread-local playwright anymore). See the below code snippet to clarify:

import random
import threading
from concurrent.futures.thread import ThreadPoolExecutor
from time import sleep

from loguru import logger
from playwright.sync_api import Playwright, BrowserType, BrowserContext, Page
from playwright.sync_api import sync_playwright


class Tls(threading.local):
    def __init__(self):
        self.playwright: Playwright = None
        self.browser: BrowserType = None
        self.context: BrowserContext = None
        self.page: Page = None


class Generator:
    tls = Tls()

    def __init__(self):
        pass

    def run(self, k):
        logger.info("THREAD: %s - ENTER" % k)

        self.tls.playwright = sync_playwright().start()
        self.tls.browser = self.tls.playwright.firefox.launch(headless=True)

        # Create 3 different contexts
        self.tls.context = self.tls.browser.new_context( )
        self.tls.second_context = self.tls.browser.new_context( )
        self.tls.third_context = self.tls.browser.new_context( )
        
        # Create 3 different pages
        self.tls.page = self.tls.context.new_page()
        self.tls.second_page = self.tls.context.new_page()
        self.tls.third_page = self.tls.context.new_page()
        
        # Navigate to 3 different websites
        self.tls.page.goto("https://google.com")
        self.tls.second_page .goto("https://web.de")
        self.tls.page.third_page.goto("https://cnn.com")
        # do separate work on  each of the pages
        # how can we achieve concurrency in this scenario?    
    
        self.tls.page.close()
        self.tls.second_page.close()
        self.tls.third_page.close()

        self.tls.context.close()
        self.tls.second_context.close()
        self.tls.third_context.close()

        self.tls.browser.close()
        self.tls.playwright.stop()

        logger.info("THREAD: %s - EXIT" % k)


if __name__ == "__main__":
    generators = list()
    tpe = ThreadPoolExecutor()
    for i in range(1, 11):
        generator = Generator()
        generators.append(generator)
        tpe.submit(generator.run, i)
        sleep(0.1)
    tpe.shutdown(wait=False)

    while sum([int(t.is_alive()) for t in tpe._threads]) > 1:
        sleep(3)

@kumaraditya303
Copy link
Contributor

kumaraditya303 commented May 13, 2021

@chwba First I would recommend you to initialise playwright in the thread local constructor so that it can be reused as in my earlier example I gave earlier. I meant by 3~4 contexts that you want to process them synchronously, but as stated by you above that you need concurrency there too, then you should create one context per thread and parallelize them with thread, then you should try to increase thread by 64 to higher till your server handles it correctly since 64 threads means 64 subprocesses and may be you will be out of memory.

@sla-te
Copy link

sla-te commented May 17, 2021

@kumaraditya303 Thank you again for the thorough replies. I wanted to let you know, that we have now finished creating the solution - We went for 1 playwright/context/page per thread and were able to achieve 45-50 simultaneous threads running headless firefox browsers with a CPU load of ~85% on average with peaks to 100%, ram is no issue at all, we got only 30% used on full load. If we increase to 55 or more, the CPU load will stay at 100% though and it does not feel safe to do that over a long period x). Do you have any hints regarding how we could tweak playwright a little further to maybe squeeze out another couple threads?

PS: Regarding the local constructor, we had left it on None, because we thought it would start a playwright process also in the main process in that case and we only run playwright inside the threads, outside we only print statistics and do management of everything else.
PS 2: We are using an AMD EPYC 7502P with 128 GB DDR4 ECC RAM

@kumaraditya303
Copy link
Contributor

kumaraditya303 commented May 17, 2021

@chwba The CPU load is fair as there are would be around 50 processes running simultaneously of browsers, I don't think you could do more threads at this time. However, if you want even more performance out of the it, you can combine threading and asyncio to get even better performance which would give you more performance in the same CPU load but then you would use multiple contexts and hence it would use more RAM

@simplPart
Copy link

simplPart commented Jun 12, 2021

combine threading and asyncio

can you show how to do it correctly with asyncio? and does firefox have any arguments like https://stackoverflow.com/a/58589026 to reduce the load on the processors? (I have many contexts running. 1 browser per process and many contexts per thread).

first I start the process -> start the asyncio loop in the process -> start the browser -> create about 5 threads, transfer the browser -> start about 5 contexts per thread. - This loads the server heavily (6 cores, 32 RAM), cores are loaded at 100% and then the browser is simply closed and "Target page, context or browser has been closed" ((

@Konano
Copy link

Konano commented Oct 17, 2021

combine threading and asyncio

can you show how to do it correctly with asyncio? and does firefox have any arguments like https://stackoverflow.com/a/58589026 to reduce the load on the processors? (I have many contexts running. 1 browser per process and many contexts per thread).

first I start the process -> start the asyncio loop in the process -> start the browser -> create about 5 threads, transfer the browser -> start about 5 contexts per thread. - This loads the server heavily (6 cores, 32 RAM), cores are loaded at 100% and then the browser is simply closed and "Target page, context or browser has been closed" ((

Same problem

@kumaraditya303
Copy link
Contributor

@Konano create a new issue for asyncio

@Konano
Copy link

Konano commented Oct 17, 2021

@kumaraditya303 you mean this problem is related to asyncio, not to playwright?

@kumaraditya303
Copy link
Contributor

@kumaraditya303 you mean this problem is related to asyncio, not to playwright?

No create a new issue on playwright-python repo to separate the discussion from this one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants