问题描述
我正在寻找一种简单的方法来保存源自已发布的 Google 表格文档的 csv 文件?由于它已发布,因此可以通过直接链接访问(在下面的示例中特意修改).
I am looking for a simple way to save a csv file originating from a published Google Sheets document? Since it's published, it's accessible through a direct link (modified on purpose in the example below).
一旦我启动链接,我的所有浏览器都会提示我保存 csv 文件.
All my browsers will prompt me to save the csv file as soon as I launch the link.
都不是:
DOC_URL = 'https://docs.google.com/spreadsheet/ccc?key=0AoOWveO-dNo5dFNrWThhYmdYW9UT1lQQkE&output=csv'
f = urllib.request.urlopen(DOC_URL)
cont = f.read(SIZE)
f.close()
cont = str(cont, 'utf-8')
print(cont)
,也不是:
req = urllib.request.Request(DOC_URL)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.13 (KHTML, like Gecko) Chrome/24.0.1284.0 Safari/537.13')
f = urllib.request.urlopen(req)
print(f.read().decode('utf-8'))
打印除 html 内容之外的任何内容.
print anything but html content.
(在阅读其他帖子后尝试了第二个版本:使用 python 将谷歌文档公共电子表格下载到 csv.)
(Tried the 2nd version after reading this other post: Download google docs public spreadsheet to csv with python .)
知道我做错了什么吗?我已经退出了我的 Google 帐户,如果这值得的话,但这适用于我尝试过的任何浏览器.据我了解,Google Docs API 还没有移植到 Python 3 上,并且考虑到我个人使用的小项目的玩具"规模,从一开始就使用它甚至没有太大意义,如果我可以绕过它.
Any idea on what I am doing wrong? I am logged out of my Google account, if that worths to anything, but this works from any browser that I tried. As far as I understood, the Google Docs API is not yet ported on Python 3 and given the "toy" magnitude of my little project for personal use, it would not even make too much sense to use it from the get-go, if I can circumvent it.
在第二次尝试中,我离开了用户代理",因为我在想可能被认为来自脚本的请求(b/c 不存在标识信息)可能会被忽略,但它没有产生区别.
In the 2nd attempt, I left the 'User-Agent', as I was thinking that maybe requests thought as coming from scripts (b/c no identification info is present) might be ignored, but it didn't make a difference.
推荐答案
Google 通过一系列 cookie 设置 302 重定向响应初始请求.如果您不存储并在请求之间重新提交 cookie,它会将您重定向到登录页面.
Google responds to the initial request with a series of cookie-setting 302 redirects. If you don't store and resubmit the cookies between requests, it redirects you to the login page.
所以,问题不在于 User-Agent 标头,而是默认情况下,urllib.request.urlopen
不存储 cookie,但它会遵循 HTTP 302 重定向.
So, the problem is not with the User-Agent header, it's the fact that by default, urllib.request.urlopen
doesn't store cookies, but it will follow the HTTP 302 redirects.
以下代码在 DOC_URL
指定位置的公共电子表格上运行良好:
The following code works just fine on a public spreadsheet available at the location specified by DOC_URL
:
>>> from http.cookiejar import CookieJar
>>> from urllib.request import build_opener, HTTPCookieProcessor
>>> opener = build_opener(HTTPCookieProcessor(CookieJar()))
>>> resp = opener.open(DOC_URL)
>>> # should really parse resp.getheader('content-type') for encoding.
>>> csv_content = resp.read().decode('utf-8')
<小时>
已经向您展示了如何在 vanilla python 中执行此操作,我现在要说正确的方法是使用最优秀的 请求库.它是非常有据可查的,让这些任务完成起来非常愉快.
Having shown you how to do it in vanilla python, I'll now say that the Right Way to go about this is to use the most-excellent requests library. It is extremely well documented and makes these sorts of tasks incredibly pleasant to complete.
例如,使用 requests
库获得与上述相同的 csv_content
非常简单:
For instance, to get the same csv_content
as above using the requests
library is as simple as:
>>> import requests
>>> csv_content = requests.get(DOC_URL).text
那一行更清楚地表达了您的意图.它更容易编写和阅读.做你自己 - 以及任何分享你代码库的其他人 - 一个忙,只需使用 requests
.
That single line expresses your intent more clearly. It's easier to write and easier to read. Do yourself - and anyone else who shares your codebase - a favor and just use requests
.
这篇关于如何从 Python 3(或 2)将 Google 表格文件保存为 CSV?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!