1. <tfoot id='ZXJF4'></tfoot>
    2. <small id='ZXJF4'></small><noframes id='ZXJF4'>

        <legend id='ZXJF4'><style id='ZXJF4'><dir id='ZXJF4'><q id='ZXJF4'></q></dir></style></legend>
        • <bdo id='ZXJF4'></bdo><ul id='ZXJF4'></ul>

      1. <i id='ZXJF4'><tr id='ZXJF4'><dt id='ZXJF4'><q id='ZXJF4'><span id='ZXJF4'><b id='ZXJF4'><form id='ZXJF4'><ins id='ZXJF4'></ins><ul id='ZXJF4'></ul><sub id='ZXJF4'></sub></form><legend id='ZXJF4'></legend><bdo id='ZXJF4'><pre id='ZXJF4'><center id='ZXJF4'></center></pre></bdo></b><th id='ZXJF4'></th></span></q></dt></tr></i><div id='ZXJF4'><tfoot id='ZXJF4'></tfoot><dl id='ZXJF4'><fieldset id='ZXJF4'></fieldset></dl></div>

        如何实现硒刮板的并行运行

        How To Run Selenium-scrapy in parallel(如何实现硒刮板的并行运行)

          <bdo id='Fm3nJ'></bdo><ul id='Fm3nJ'></ul>
        • <tfoot id='Fm3nJ'></tfoot>

          <small id='Fm3nJ'></small><noframes id='Fm3nJ'>

                  <tbody id='Fm3nJ'></tbody>
                <legend id='Fm3nJ'><style id='Fm3nJ'><dir id='Fm3nJ'><q id='Fm3nJ'></q></dir></style></legend>
                  <i id='Fm3nJ'><tr id='Fm3nJ'><dt id='Fm3nJ'><q id='Fm3nJ'><span id='Fm3nJ'><b id='Fm3nJ'><form id='Fm3nJ'><ins id='Fm3nJ'></ins><ul id='Fm3nJ'></ul><sub id='Fm3nJ'></sub></form><legend id='Fm3nJ'></legend><bdo id='Fm3nJ'><pre id='Fm3nJ'><center id='Fm3nJ'></center></pre></bdo></b><th id='Fm3nJ'></th></span></q></dt></tr></i><div id='Fm3nJ'><tfoot id='Fm3nJ'></tfoot><dl id='Fm3nJ'><fieldset id='Fm3nJ'></fieldset></dl></div>
                1. 本文介绍了如何实现硒刮板的并行运行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着跟版网的小编来一起学习吧!

                  问题描述

                  我正在尝试使用scrapy和Selenium抓取一个javascript网站。我使用Selenium和Chrome驱动程序打开javascript网站,使用scrapy从当前页面抓取指向不同清单的所有链接,并将它们存储在列表中(到目前为止,尝试使用seleniumRequest跟踪链接并回调到解析新页面函数会导致很多错误)。然后,我循环遍历URL列表,在Selenium驱动程序中打开它们,并从页面中抓取信息。到目前为止,这个速度是16页/分钟,考虑到这个网站上的列表数量,这并不理想。理想情况下,我会让Selenium驱动程序并行打开链接,如下所示:

                  How can I make Selenium run in parallel with Scrapy?

                  https://gist.github.com/miraculixx/2f9549b79b451b522dde292c4a44177b

                  但是,我不知道如何在Selenium-scrapy代码中实现并行处理。`

                      import scrapy
                      import time
                      from scrapy.selector import Selector
                      from scrapy_selenium import SeleniumRequest
                      from selenium.webdriver.common.keys import Keys
                      from selenium.webdriver.support.ui import Select
                      from selenium.webdriver.support.ui import WebDriverWait
                      from selenium.webdriver.common.by import By
                      from selenium.webdriver.support import expected_conditions as EC
                  
                  class MarketPagSpider(scrapy.Spider):
                      name = 'marketPagination'
                  def start_requests(self):
                      yield SeleniumRequest(
                          url="https://www.cryptoslam.io/nba-top-shot/marketplace",
                          wait_time=5,
                          wait_until=EC.presence_of_element_located((By.XPATH, '//SELECT[@name="table_length"]')),
                          callback=self.parse
                      )
                  
                  responses = []
                  
                  def parse(self, response):
                      # initialize driver
                      driver = response.meta['driver']
                      driver.set_window_size(1920,1080)
                  
                      time.sleep(1)
                      WebDriverWait(driver, 10).until(
                          EC.element_to_be_clickable((By.XPATH, "(//th[@class='nowrap sorting'])[1]"))
                      )
                  
                      rows = response_obj.xpath("//tbody/tr[@role='row']")
                      for row in rows:
                          link = row.xpath(".//td[4]/a/@href").get()
                          absolute_url = response.urljoin(link)
                  
                          self.responses.append(absolute_url)
                  
                      for resp in self.responses:
                          driver.get(resp)
                          html = driver.page_source 
                          response_obj = Selector(text=html)
                  
                          yield {
                          'name': response_obj.xpath("//div[@class='ibox-content animated fadeIn fetchable-content js-attributes-wrapper']/h4[4]/span/a/text()").get(),
                          'price': response_obj.xpath("//span[@class='js-auction-current-price']/text()").get()
                          
                          }
                  

                  我知道scrapy-spash可以处理多进程,但我试图刮掉的网站不能在spash中打开(至少我不这么认为)

                  此外,我还删除了用于分页的代码行,以保持代码简洁。

                  我对此非常陌生,并乐于接受有关使用Selenium进行多处理的任何建议和解决方案。

                  推荐答案

                  以下示例程序出于演示目的创建了一个只有2个线程的线程池,然后抓取4个URL以获取其标题:

                  from multiprocessing.pool import ThreadPool
                  from bs4 import BeautifulSoup
                  from selenium import webdriver
                  import threading
                  import gc
                  
                  class Driver:
                      def __init__(self):
                          options = webdriver.ChromeOptions()
                          options.add_argument("--headless")
                          # suppress logging:
                          options.add_experimental_option('excludeSwitches', ['enable-logging'])
                          self.driver = webdriver.Chrome(options=options)
                          print('The driver was just created.')
                  
                      def __del__(self):
                          self.driver.quit() # clean up driver when we are cleaned up
                          print('The driver has terminated.')
                  
                  
                  threadLocal = threading.local()
                  
                  def create_driver():
                      the_driver = getattr(threadLocal, 'the_driver', None)
                      if the_driver is None:
                          the_driver = Driver()
                          setattr(threadLocal, 'the_driver', the_driver)
                      return the_driver.driver
                  
                  
                  def get_title(url):
                      driver = create_driver()
                      driver.get(url)
                      source = BeautifulSoup(driver.page_source, "lxml")
                      title = source.select_one("title").text
                      print(f"{url}: '{title}'")
                  
                  # just 2 threads in our pool for demo purposes:
                  with ThreadPool(2) as pool:
                      urls = [
                          'https://www.google.com',
                          'https://www.microsoft.com',
                          'https://www.ibm.com',
                          'https://www.yahoo.com'
                      ]
                      pool.map(get_title, urls)
                      # must be done before terminate is explicitly or implicitly called on the pool:
                      del threadLocal
                      gc.collect()
                  # pool.terminate() is called at exit of with block
                  

                  打印:

                  The driver was just created.
                  The driver was just created.
                  https://www.google.com: 'Google'
                  https://www.microsoft.com: 'Microsoft - Official Home Page'
                  https://www.ibm.com: 'IBM - United States'
                  https://www.yahoo.com: 'Yahoo'
                  The driver has terminated.
                  The driver has terminated.
                  

                  这篇关于如何实现硒刮板的并行运行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!

                  本站部分内容来源互联网,如果有图片或者内容侵犯了您的权益,请联系我们,我们会在确认后第一时间进行删除!

                  相关文档推荐

                  Split a Pandas column of lists into multiple columns(将 Pandas 的列表列拆分为多列)
                  How does the @property decorator work in Python?(@property 装饰器在 Python 中是如何工作的?)
                  What is the difference between old style and new style classes in Python?(Python中的旧样式类和新样式类有什么区别?)
                  How to break out of multiple loops?(如何打破多个循环?)
                  How to put the legend out of the plot(如何将传说从情节中剔除)
                  Why is the output of my function printing out quot;Nonequot;?(为什么我的函数输出打印出“无?)
                    <tbody id='w4Kph'></tbody>
                    <bdo id='w4Kph'></bdo><ul id='w4Kph'></ul>
                    • <i id='w4Kph'><tr id='w4Kph'><dt id='w4Kph'><q id='w4Kph'><span id='w4Kph'><b id='w4Kph'><form id='w4Kph'><ins id='w4Kph'></ins><ul id='w4Kph'></ul><sub id='w4Kph'></sub></form><legend id='w4Kph'></legend><bdo id='w4Kph'><pre id='w4Kph'><center id='w4Kph'></center></pre></bdo></b><th id='w4Kph'></th></span></q></dt></tr></i><div id='w4Kph'><tfoot id='w4Kph'></tfoot><dl id='w4Kph'><fieldset id='w4Kph'></fieldset></dl></div>

                      <tfoot id='w4Kph'></tfoot>
                    • <legend id='w4Kph'><style id='w4Kph'><dir id='w4Kph'><q id='w4Kph'></q></dir></style></legend>

                        • <small id='w4Kph'></small><noframes id='w4Kph'>