playwright远程调用和简单实现

巨硬出品的playwright是一个非常不错的自动化工具,能够用同一套API去控制Chromium, FirefoxWebKit ,看到这儿大家也许会想到可以用这玩意儿做爬虫,相比puppeteer或者pyppeteer,他们适配的浏览器更多,适配的语言更多。但是其官方不提供远程调用的方式。这儿是一个简单的远程调用playwright的梳理流程和简单的代码。以playwright-python版本作为示例,其他语言的playwright的做法是一样的。

安装和启动方式

python版本的playwright在启动浏览器进程的时候是用nodejs版本的进行启动的

image-20210326165427809

在安装的时候setup.py会去下载对应版本的nodejs的版本

image-20210326165558183

启动代码为

https://github.com/microsoft/playwright-python/blob/03e5cd01fdda2125cea47ab443d34564f767af13/playwright/_impl/_transport.py#L57

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
class Transport:
async def run(self) -> None:
self._loop = asyncio.get_running_loop()
self._stopped_future: asyncio.Future = asyncio.Future()

self._proc = proc = await asyncio.create_subprocess_exec(
str(self._driver_executable),#self._driver_executable就是 /Users/lozzo/.virtualenvs/py37/lib/python3.7/site-packages/playwright/driver/playwright.sh
"run-driver",
stdin=asyncio.subprocess.PIPE,
stdout=asyncio.subprocess.PIPE,
stderr=_get_stderr_fileno(),
limit=32768,
)
assert proc.stdout
assert proc.stdin
self._output = proc.stdin

while not self._stopped:
try:
buffer = await proc.stdout.readexactly(4)
length = int.from_bytes(buffer, byteorder="little", signed=False)
buffer = bytes(0)
while length:
to_read = min(length, 32768)
data = await proc.stdout.readexactly(to_read)
length -= to_read
if len(buffer):
buffer = buffer + data
else:
buffer = data
obj = json.loads(buffer)

if "DEBUGP" in os.environ: # pragma: no cover
print("\x1b[33mRECV>\x1b[0m", json.dumps(obj, indent=2))
self.on_message(obj)
except asyncio.IncompleteReadError:
break
await asyncio.sleep(0)
self._stopped_future.set_result(None)

所以要使playwright能够进行远程访问,只需要修改python版本的启动方式,然后后面的远程的nodejs版本更具请求参数进行转换,然后将ws暴露出来,即可进行远程访问

image-20210329171233140

python端和nodejs端是通过进程间通信进行通信的,所以我们只需要在python和nodejs外各包一层,然后在让他们外包的一层之间用socket进行通信即可实现远程调用。

image-20210329171244665

代码实现

python端代码修改

修改代码https://github.com/microsoft/playwright-python/blob/29cddbd5174ab262e5cb57b2d8c8fbcf8df3e171/playwright/_impl/_driver.py#L24

1
2
3
4
5
6
7
def compute_driver_executable() -> Path:
return Path("/Users/lozzo/.virtualenvs/py37/lib/python3.7/site-packages/playwright/driver/playwright.sh")
# package_path = Path(inspect.getfile(playwright)).parent
# platform = sys.platform
# if platform == "win32":
# return package_path / "driver" / "playwright.cmd"
# return package_path / "driver" / "playwright.sh"

其中/Users/lozzo/.virtualenvs/py37/lib/python3.7/site-packages/playwright/driver/playwright.sh内容为

1
2
3
4
5
#!/bin/sh
#SCRIPT_PATH="$(cd "$(dirname "$0")" ; pwd -P)"
#$SCRIPT_PATH/node $SCRIPT_PATH/package/lib/cli/cli.js "$@"
cd /Users/lozzo/workdir/sovietironfist/test
ts-node processPipe.ts

processPipe.ts内容为

1
2
3
4
5
6
7
import net from "net"
(async()=>{
const socket = new net.Socket()
socket.connect({host:"127.0.0.1",port:12345})
process.stdin.pipe(socket)
socket.pipe(process.stdout)
})()

需要注意的是请不要在这个脚本中使用console.*进行任何标准输入输出操作,这些操作会被本地的playwright-python进行读取,产生异常

当然也可以修改playwright/_impl/_transport.py Transport类,使之直接和远端的socket链接,少走一层

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
class Transport:
async def run(self) -> None:
self._loop = asyncio.get_running_loop()
self._stopped_future: asyncio.Future = asyncio.Future()

reader, writer = await asyncio.open_connection(host='127.0.0.1',port=12345)
self._output = writer

while not self._stopped:
try:
buffer = await reader.readexactly(4)
length = int.from_bytes(buffer, byteorder="little", signed=False)
buffer = bytes(0)
while length:
to_read = min(length, 32768)
data = await reader.readexactly(to_read)
length -= to_read
if len(buffer):
buffer = buffer + data
else:
buffer = data
obj = json.loads(buffer)

if "DEBUGP" in os.environ: # pragma: no cover
print("\x1b[33mRECV>\x1b[0m", json.dumps(obj, indent=2))
self.on_message(obj)
except asyncio.IncompleteReadError:
break
await asyncio.sleep(0)
self._stopped_future.set_result(None)

node端代码修改

常驻进程为socketPipe.ts

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import net from 'net'
import { spawn, ChildProcessWithoutNullStreams } from 'child_process'
;(async () => {
let child: ChildProcessWithoutNullStreams | undefined
const server = net.createServer()
server.listen(12345)
const close = () => {
if (child) {
child.kill(0)
child = undefined
}
}
server.on('connection', (socket: net.Socket) => {
console.log('connection')
if (!child) {
child = spawn('/Users/lozzo/.virtualenvs/py37/lib/python3.7/site-packages/playwright/driver/b.sh')
}
child.on('exit', (number, signal) => {
console.log('exit', number, signal)
})
child.stdout.pipe(socket)
socket.pipe(child.stdin)
socket.on('error', close)
})

server.on('close', close)
server.on('error', close)
})()

其中/Users/lozzo/.virtualenvs/py37/lib/python3.7/site-packages/playwright/driver/b.sh内容为

1
2
3
#!/bin/sh
SCRIPT_PATH="$(cd "$(dirname "$0")" ; pwd -P)"
$SCRIPT_PATH/node $SCRIPT_PATH/package/lib/cli/cli.js 'run-driver'

服务端启动ts-node socketPipe.ts

然后就可以远程调用了(本地无感使用),同理,其他的语言的服务是可以一样操作的


本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!