Python:如何从页面下载Excel文件
- 转到此urlhttps://www.horseracebase.com/horse-racing-results.php?year=2005&month=3&day=15 (用户名= TrickyBen |密码= TrickyBen123)
- 请注意,有一个下载Excelbutton(红色)
- 我想下载excel文件并将其转换为pandas数据框。 我想以编程方式(即从脚本,而不是通过手动点击网站)。 我将如何做到这一点?
此代码将让您login为TrickyBen,并提出请求的网站API …
从lxml导入请求从请求导入html导入会话导入pandas作为pd导入shutil
raceSession = Session() LoginDetails = {'login': 'TrickyBen', 'password': 'TrickyBen123'} LoginUrl = 'https://www.horseracebase.com/horse-racing-results.php?year=2005&month=3&day=15/horsebase1.php' LoginPost = raceSession.post(LoginUrl, data=LoginDetails) RaceUrl = 'https://www.horseracebase.com/excelresults.php' RaceDataDetails = {"user": "41495", "racedate": "2005-3-15", "downloadbutton": "Excel"} PostHeaders = {"Content-Type": "application/x-www-form-urlencoded"} Response = raceSession.post(RaceUrl, data=RaceDataDetails, headers=PostHeaders) Table = pd.read_table(Response.text) Table.to_csv('blahblah.csv')
如果你检查元素,你会注意到相关的元素看起来像这样…
<form action="excelresults.php" method="post"> <input type="hidden" name="user" value="41495"> <input type="hidden" name="racedate" value="2005-3-15"> <input type="submit" class="downloadbutton" value="Excel"> </form>
我得到这个错误消息…
Traceback (most recent call last): File "/Users/Alex/Desktop/DateTest/hrpull.py", line 20, in <module> Table = pd.read_table(Response.text) File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 562, in parser_f return _read(filepath_or_buffer, kwds) File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 315, in _read parser = TextFileReader(filepath_or_buffer, **kwds) File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 645, in __init__ self._make_engine(self.engine) File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 799, in _make_engine self._engine = CParserWrapper(self.f, **self.options) File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 1213, in __init__ self._reader = _parser.TextReader(src, **kwds) File "pandas/parser.pyx", line 358, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3427) File "pandas/parser.pyx", line 628, in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:6861) IOError: File race_date race_time track race_name race_restrictions_age race_class major race_distance prize_money going_description number_of_runners place distbt horse_name stall trainer horse_age jockey_name jockeys_claim pounds odds fav official_rating comptime TotalDstBt MedianOR Dist_Furlongs placing_numerical RCode BFSP BFSP_Place PlcsPaid BFPlcsPaid Yards RailMove RaceType "2005-03-15" "14:00:00" "Cheltenham" "Letheby & Christopher Supreme Novices Hurdle " "4yo+" "Class 1" "Grade 1" "2m˝f " "58000" "Good" "20" "1st" "Arcalis" "0" "Johnson, J Howard" "5" "Lee, G" "0" "161" "21" "136" "3 mins 53.00s" "121.5" "16.5" "1" "National Hunt" "0" "0" "3" "0" "0" "0" "Novices Hurdle" "2005-03-15" "14:00:00" "Cheltenham" "Letheby & Christopher Supreme Novices Hurdle " "4yo+" "Class 1" "Grade 1" "2m˝f " "58000" "Good" "20" "2nd" "6" "Wild Passion (GER)" "0" "Meade, Noel" "5" "Carberry, P" "0" "161" "11" "0" "3 mins 53.00s" "6" "121.5" "16.5" "2" "National Hunt" "0" "0" "3" "0" "0" "0" "Novices Hurdle"
我在想,你可以看到你想要在另一个网页下载的数据,例如,通过点击“我的系统(v4)”。 如果你可以这样做,那么你可以使用urllib.request.urlretrieve来下载该页面。 然后你可以使用html.parser.HTMLParserparsing数据,并按照你的意愿做。
如果你看看在表单动作中调用的API,你会看到你需要对这个URL做一个post请求:
https://www.horseracebase.com/excelresults.php
具有以下参数:
data = { "user": "41495", # looks like this varies with login, so update in case you change your login id "racedate": "2005-3-15", "downloadbutton": "Excel" }
你可以做这样的事情:
response = raceSession.post(reqUrl, json=data)
如果这不起作用,请尝试添加标头到请求,如: headers=postHeaders
。 例如。 在这种情况下,您应该设置内容types标题,因为您正在发送表单编码数据,所以:
headers = {"Content-Type": "application/x-www-form-urlencoded"}
阅读这个更多关于如何将Excel保存到文件的信息
这是Postman的这个请求的响应,看起来你不需要任何额外的头,除了content-type
:
编辑
这是你需要做的:
raceSession = Session() RaceUrl = 'https://www.horseracebase.com/excelresults.php' RaceDataDetails = {"user": "41495", "racedate": "2005-3-15", "downloadbutton": "Excel"} PostHeaders = {"Content-Type": "application/x-www-form-urlencoded"} Response = raceSession.post(RaceUrl, data=RaceDataDetails, headers=PostHeaders) # from StringIO import StringIO #for python 2.x #import StringIO #for python 3.x Table = pd.read_table(StringIO(Response.text))