我正在使用curl(通过PHP)抓取一个网站,我想要的一些信息是产品列表,默认情况下仅显示前几个产品。其余的将在用户单击按钮以获取产品的完整列表时传递给用户,从而触发ajax调用以返回该列表。
简而言之,这是他们使用的JS:
headers['__RequestVerificationToken'] = token; $.ajax({ type: "post", url: "/ajax/getProductList", dataType: 'html', data: JSON.stringify({ historyPageIndex: 1, displayPeriod: 0, productsType: All }), contentType: 'application/json; charset=utf-8', success: function (result) { $(target).html(""); $(target).html(result); }, beforeSend: function (XMLHttpRequest) { if (headers['__RequestVerificationToken']) { XMLHttpRequest.setRequestHeader("__RequestVerificationToken", headers['__RequestVerificationToken']); } } });
这是我的PHP脚本:
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); curl_setopt($ch, CURLOPT_MAXREDIRS, 10); curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieLocation); curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieLocation); curl_setopt($ch, CURLOPT_POST, false); curl_setopt($ch, CURLOPT_URL, 'https://www.domain.com/Applications/ViewProducts'); curl_setopt($ch, CURLOPT_REFERER, 'https://www.domain.com/'); $webpage = curl_exec($ch); $productsType = trim(find_by_pattren($webpage, '<input id="productsType" name="productsType" type="hidden" value="(.*?)"')); $token = trim(find_by_pattren($webpage, '<input name="__RequestVerificationToken" type="hidden" value="(.*?)"')); $postVariables = 'productsType='.$productsType. '&historyPageIndex=1 &displayPeriod=0 &__RequestVerificationToken='.$token; curl_setopt($ch, CURLOPT_POST, true); curl_setopt($ch, CURLOPT_POSTFIELDS, $postVariables); curl_setopt($ch, CURLOPT_URL, 'https://www.domain.com/ajax/getProductList'); curl_setopt($ch, CURLOPT_REFERER, 'https://www.domain.com/Applications/ViewProducts'); $webpage = curl_exec($ch);
这将在该站点上产生一个错误页面。我认为主要原因可能是:
他们检查这是否是ajax请求(不知道如何解决)
令牌必须在标题中,而不在post变量中
任何想法?
编辑:这是工作代码:
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); curl_setopt($ch, CURLOPT_MAXREDIRS, 10); curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieLocation); curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieLocation); curl_setopt($ch, CURLOPT_URL, 'https://www.domain.com/Applications/ViewProducts'); curl_setopt($ch, CURLOPT_REFERER, 'https://www.domain.com/'); $webpage = curl_exec($ch); $productsType = trim(find_by_pattren($webpage, '<input id="productsType" name="productsType" type="hidden" value="(.*?)"')); $token = trim(find_by_pattren($webpage, '<input name="__RequestVerificationToken" type="hidden" value="(.*?)"')); $postVariables = json_encode(array('productsType' => $productsType, 'historyPageIndex' => 1, 'displayPeriod' => 0)); curl_setopt($ch, CURLOPT_POST, true); curl_setopt($ch, CURLOPT_HTTPHEADER, array("X-Requested-With: XMLHttpRequest", "Content-Type: application/json; charset=utf-8", "__RequestVerificationToken: $token")); curl_setopt($ch, CURLOPT_POSTFIELDS, $postVariables); curl_setopt($ch, CURLOPT_URL, 'https://www.domain.com/ajax/getProductList'); curl_setopt($ch, CURLOPT_REFERER, 'https://www.domain.com/Applications/ViewProducts'); $webpage = curl_exec($ch);
要将请求验证令牌设置为标头,更紧密地模仿AJAX请求,并将content-type设置为JSON,请使用CURLOPT_HEADER。
curl_setopt($ch, CURLOPT_HTTPHEADER, array("X-Requested-With: XMLHttpRequest", "Content-Type: application/json; charset=utf-8", "__RequestVerificationToken: $token"));
我还注意到,您在代码的第7行上多余地将CURLOPT_POST设置为false,并且您发送的帖子数据不是JSON格式。你应该有:
$postVariables = '{"historyPageIndex":1,"displayPeriod":0,"productsType":"All"}';