Scrapy学习: 各种URL
爬虫过程中的各种URL处理
There are mainly three ways to achieve that:
Using urljoin
function from urllib
:
from urllib.parse import urljoin
# Same as: from w3lib.url import urljoin
url = urljoin(base_url, relative_url)
Using the response's urljoin
wrapper method, as mentioned by Steve.
url = response.urljoin(relative_url)
If you also want to yield a request from that link, you can use the handful response's follow
method:
# It will create a new request using the above "urljoin" method
yield response.follow(relative_url, callback=self.parse)