新知榜官方账号
2023-10-13 16:05:28
本文介绍如何使用Python编程实现自动生成优秀的页面Title,通过爬虫获取百度搜索结果中所有着陆页的真实标题,使用分词和词频统计的方法得出标题中的高频词汇,并结合人类语言组织成一句话,最终得出优秀的页面Title。
首先需要安装requests库,使用requests.get()方法获取百度搜索结果页HTML代码。
import requests
url = 'http://www.baidu.com/s?wd=月亮虾饼怎么做&rn=50'
response = requests.get(url)
print(response.content)通过chardet库快速检测网页编码,并将编码后的HTML代码转换成字符串。
import chardet
htmlEncoded = response.content
detectResult = chardet.detect(htmlEncoded)
encoding = detectResult['encoding']
html = str(htmlEncoded, encoding)
print(html)通过BeautifulSoup库解析HTML代码,并使用CSS选择器提取出所有着陆页的链接,接着通过访问链接获取着陆页的真实URL地址。
import requests
import chardet
from bs4 import BeautifulSoup
url = 'http://www.baidu.com/s?wd=月亮虾饼怎么做&rn=50'
response = requests.get(url)
htmlEncoded = response.content
detectResult = chardet.detect(htmlEncoded)
encoding = detectResult['encoding']
html = str(htmlEncoded, encoding)
soup = BeautifulSoup(html, 'html.parser')
items = soup.select('h3 a')
for item in items:
resultRedirectUrl = item.attrs['href']
if 'http://' in resultRedirectUrl or 'https://' in resultRedirectUrl:
itemHeadRes = requests.head(resultRedirectUrl, verify=False)
itemUrl = itemHeadRes.headers['Location']
print(itemUrl)通过使用requests库中的head()方法解密百度搜索结果着陆页的URL地址。
import requests
import chardet
from bs4 import BeautifulSoup
url = 'http://www.baidu.com/s?wd=月亮虾饼怎么做&rn=50'
response = requests.get(url)
htmlEncoded = response.content
detectResult = chardet.detect(htmlEncoded)
encoding = detectResult['encoding']
html = str(htmlEncoded, encoding)
soup = BeautifulSoup(html, 'html.parser')
items = soup.select('h3 a')
for item in items:
resultRedirectUrl = item.attrs['href']
if 'http://' in resultRedirectUrl or 'https://' in resultRedirectUrl:
itemHeadRes = requests.head(resultRedirectUrl, verify=False)
itemUrl = itemHeadRes.headers['Location']
print('搜索结果真实网址:', itemUrl)通过使用BeautifulSoup库解析HTML,提取所有着陆页的标题。
import requests
import chardet
from bs4 import BeautifulSoup
url = 'http://www.baidu.com/s?wd=月亮虾饼怎么做&rn=50'
response = requests.get(url)
htmlEncoded = response.content
detectResult = chardet.detect(htmlEncoded)
encoding = detectResult['encoding']
html = str(htmlEncoded, encoding)
soup = BeautifulSoup(html, 'html.parser')
items = soup.select('h3 a')
allTitleStr = ''
for item in items:
resultRedirectUrl = item.attrs['href']
if 'http://' in resultRedirectUrl or 'https://' in resultRedirectUrl:
itemHeadRes = requests.head(resultRedirectUrl, verify=False)
itemUrl = itemHeadRes.headers['Location']
itemRes = requests.get(itemUrl, verify=False)
if itemRes.status_code == 200:
itemHtmlEncoding = chardet.detect(itemRes.content)['encoding']
itemHtml = str(itemRes.content, itemHtmlEncoding, errors='ignore')
itemSoup = BeautifulSoup(itemHtml, 'html.parser')
if itemSoup.title is not None:
itemTitle = itemSoup.title.text.strip()
print('着陆页Title:', itemTitle)
allTitleStr += itemTitle
print(allTitleStr)通过使用jieba库对所有标题中的词汇进行分词,并统计出所有词汇的词频。
import requests
import chardet
from bs4 import BeautifulSoup
import jieba
url = 'http://www.baidu.com/s?wd=月亮虾饼怎么做&rn=50'
response = requests.get(url)
htmlEncoded = response.content
detectResult = chardet.detect(htmlEncoded)
encoding = detectResult['encoding']
html = str(htmlEncoded, encoding)
soup = BeautifulSoup(html, 'html.parser')
items = soup.select('h3 a')
allTitleStr = ''
for item in items:
resultRedirectUrl = item.attrs['href']
if 'http://' in resultRedirectUrl or 'https://' in resultRedirectUrl:
itemHeadRes = requests.head(resultRedirectUrl, verify=False)
itemUrl = itemHeadRes.headers['Location']
itemRes = requests.get(itemUrl, verify=False)
if itemRes.status_code == 200:
itemHtmlEncoding = chardet.detect(itemRes.content)['encoding']
itemHtml = str(itemRes.content, itemHtmlEncoding, errors='ignore')
itemSoup = BeautifulSoup(itemHtml, 'html.parser')
if itemSoup.title is not None:
itemTitle = itemSoup.title.text.strip()
print('着陆页Title:', itemTitle)
allTitleStr += itemTitle
titleWords = [word for word in jieba.lcut(allTitleStr, cut_all=False) if len(word) > 1]
print(titleWords)通过使用Counter库对所有词汇进行词频统计,并使用python对字典进行倒序排列,以找到最高频的那些词汇。
import requests
import chardet
from bs4 import BeautifulSoup
import jieba
from collections import Counter
url = 'http://www.baidu.com/s?wd=月亮虾饼怎么做&rn=50'
response = requests.get(url)
htmlEncoded = response.content
detectResult = chardet.detect(htmlEncoded)
encoding = detectResult['encoding']
html = str(htmlEncoded, encoding)
soup = BeautifulSoup(html, 'html.parser')
items = soup.select('h3 a')
allTitleStr = ''
for item in items:
resultRedirectUrl = item.attrs['href']
if 'http://' in resultRedirectUrl or 'https://' in resultRedirectUrl:
itemHeadRes = requests.head(resultRedirectUrl, verify=False)
itemUrl = itemHeadRes.headers['Location']
itemRes = requests.get(itemUrl, verify=False)
if itemRes.status_code == 200:
itemHtmlEncoding = chardet.detect(itemRes.content)['encoding']
itemHtml = str(itemRes.content, itemHtmlEncoding, errors='ignore')
itemSoup = BeautifulSoup(itemHtml, 'html.parser')
if itemSoup.title is not None:
itemTitle = itemSoup.title.text.strip()
print('着陆页Title:', itemTitle)
allTitleStr += itemTitle
titleWords = [word for word in jieba.lcut(allTitleStr, cut_all=False) if len(word) > 1]
titleWordsDic = dict(Counter(titleWords))
titleWordsSortedList = sorted(titleWordsDic.items(), key=lambda x: x[1], reverse=True)
for item in titleWordsSortedList:
print(item[0], ':', item[1])将得到的高频词汇结合成一句话作为最终的页面Title。
import requests
import chardet
from bs4 import BeautifulSoup
import jieba
from collections import Counter
url = 'http://www.baidu.com/s?wd=月亮虾饼怎么做&rn=50'
response = requests.get(url)
htmlEncoded = response.content
detectResult = chardet.detect(htmlEncoded)
encoding = detectResult['encoding']
html = str(htmlEncoded, encoding)
soup = BeautifulSoup(html, 'html.parser')
items = soup.select('h3 a')
allTitleStr = ''
for item in items:
resultRedirectUrl = item.attrs['href']
if 'http://' in resultRedirectUrl or 'https://' in resultRedirectUrl:
itemHeadRes = requests.head(resultRedirectUrl, verify=False)
itemUrl = itemHeadRes.headers['Location']
itemRes = requests.get(itemUrl, verify=False)
if itemRes.status_code == 200:
itemHtmlEncoding = chardet.detect(itemRes.content)['encoding']
itemHtml = str(itemRes.content, itemHtmlEncoding, errors='ignore')
itemSoup = BeautifulSoup(itemHtml, 'html.parser')
if itemSoup.title is not None:
itemTitle = itemSoup.title.text.strip()
print('着陆页Title:', itemTitle)
allTitleStr += itemTitle
titleWords = [word for word in jieba.lcut(allTitleStr, cut_all=False) if len(word) > 1]
titleWordsDic = dict(Counter(titleWords))
titleWordsSortedList = sorted(titleWordsDic.items(), key=lambda x: x[1], reverse=True)
title = "".join([item[0] for item in titleWordsSortedList[:5]])
print(title)
微信扫码咨询
相关工具
相关文章
推荐
阿里Accio中文版上线!一键搞定复杂采购
2025-08-19 09:13
视频“用嘴编辑”的时代来了,但钱包顶得住吗?
2025-08-15 17:59
智谱新模型GLM-4.5V全面开源,玩家们有福啦!
2025-08-12 17:56
扎心文案+AI插画=爆款!揭秘8万赞视频的制作全流程
2025-08-12 10:08
GPT-5没你想的那么好,附实测体验~
2025-08-11 11:07
一站式搞定AI绘图+视频,AI短片效率飙升的秘密在这儿!
2025-08-08 09:26
打工人新神器!10款国产AI,让你告别996!
2025-08-08 09:24
豆包视觉推理深度体验,AI也能“边看边想”了!
2025-08-08 09:19
300美元的AI男友来了!马斯克的情感生意从女友做到男友
2025-08-01 17:56
Agent智能体:2025年企业新员工,月薪仅需一度电?
2025-07-30 17:49