SEO编程:如何自动生成优秀的页面Title?

新知榜官方账号

2023-10-13 16:05:28

SEO编程:如何自动生成优秀的页面Title?

本文介绍如何使用Python编程实现自动生成优秀的页面Title,通过爬虫获取百度搜索结果中所有着陆页的真实标题,使用分词和词频统计的方法得出标题中的高频词汇,并结合人类语言组织成一句话,最终得出优秀的页面Title。

1. 爬虫,你好!

首先需要安装requests库,使用requests.get()方法获取百度搜索结果页HTML代码。

import requests

url = 'http://www.baidu.com/s?wd=月亮虾饼怎么做&rn=50'
response = requests.get(url)
print(response.content)

2. 兼容各编码

通过chardet库快速检测网页编码,并将编码后的HTML代码转换成字符串。

import chardet

htmlEncoded = response.content
detectResult = chardet.detect(htmlEncoded)
encoding = detectResult['encoding']
html = str(htmlEncoded, encoding)
print(html)

3. 获取搜索结果着陆页信息

通过BeautifulSoup库解析HTML代码,并使用CSS选择器提取出所有着陆页的链接,接着通过访问链接获取着陆页的真实URL地址。

import requests
import chardet
from bs4 import BeautifulSoup

url = 'http://www.baidu.com/s?wd=月亮虾饼怎么做&rn=50'
response = requests.get(url)
htmlEncoded = response.content
detectResult = chardet.detect(htmlEncoded)
encoding = detectResult['encoding']
html = str(htmlEncoded, encoding)
soup = BeautifulSoup(html, 'html.parser')
items = soup.select('h3 a')
for item in items:
    resultRedirectUrl = item.attrs['href']
    if 'http://' in resultRedirectUrl or 'https://' in resultRedirectUrl:
        itemHeadRes = requests.head(resultRedirectUrl, verify=False)
        itemUrl = itemHeadRes.headers['Location']
        print(itemUrl)

4. 解密百度着陆页URL

通过使用requests库中的head()方法解密百度搜索结果着陆页的URL地址。

import requests
import chardet
from bs4 import BeautifulSoup

url = 'http://www.baidu.com/s?wd=月亮虾饼怎么做&rn=50'
response = requests.get(url)
htmlEncoded = response.content
detectResult = chardet.detect(htmlEncoded)
encoding = detectResult['encoding']
html = str(htmlEncoded, encoding)
soup = BeautifulSoup(html, 'html.parser')
items = soup.select('h3 a')
for item in items:
    resultRedirectUrl = item.attrs['href']
    if 'http://' in resultRedirectUrl or 'https://' in resultRedirectUrl:
        itemHeadRes = requests.head(resultRedirectUrl, verify=False)
        itemUrl = itemHeadRes.headers['Location']
        print('搜索结果真实网址:', itemUrl)

5. 获取所有着陆页标题

通过使用BeautifulSoup库解析HTML,提取所有着陆页的标题。

import requests
import chardet
from bs4 import BeautifulSoup

url = 'http://www.baidu.com/s?wd=月亮虾饼怎么做&rn=50'
response = requests.get(url)
htmlEncoded = response.content
detectResult = chardet.detect(htmlEncoded)
encoding = detectResult['encoding']
html = str(htmlEncoded, encoding)
soup = BeautifulSoup(html, 'html.parser')
items = soup.select('h3 a')
allTitleStr = ''
for item in items:
    resultRedirectUrl = item.attrs['href']
    if 'http://' in resultRedirectUrl or 'https://' in resultRedirectUrl:
        itemHeadRes = requests.head(resultRedirectUrl, verify=False)
        itemUrl = itemHeadRes.headers['Location']
        itemRes = requests.get(itemUrl, verify=False)
        if itemRes.status_code == 200:
            itemHtmlEncoding = chardet.detect(itemRes.content)['encoding']
            itemHtml = str(itemRes.content, itemHtmlEncoding, errors='ignore')
            itemSoup = BeautifulSoup(itemHtml, 'html.parser')
            if itemSoup.title is not None:
                itemTitle = itemSoup.title.text.strip()
                print('着陆页Title:', itemTitle)
                allTitleStr += itemTitle

print(allTitleStr)

6. 获取所有标题中的词汇

通过使用jieba库对所有标题中的词汇进行分词,并统计出所有词汇的词频。

import requests
import chardet
from bs4 import BeautifulSoup
import jieba

url = 'http://www.baidu.com/s?wd=月亮虾饼怎么做&rn=50'
response = requests.get(url)
htmlEncoded = response.content
detectResult = chardet.detect(htmlEncoded)
encoding = detectResult['encoding']
html = str(htmlEncoded, encoding)
soup = BeautifulSoup(html, 'html.parser')
items = soup.select('h3 a')
allTitleStr = ''
for item in items:
    resultRedirectUrl = item.attrs['href']
    if 'http://' in resultRedirectUrl or 'https://' in resultRedirectUrl:
        itemHeadRes = requests.head(resultRedirectUrl, verify=False)
        itemUrl = itemHeadRes.headers['Location']
        itemRes = requests.get(itemUrl, verify=False)
        if itemRes.status_code == 200:
            itemHtmlEncoding = chardet.detect(itemRes.content)['encoding']
            itemHtml = str(itemRes.content, itemHtmlEncoding, errors='ignore')
            itemSoup = BeautifulSoup(itemHtml, 'html.parser')
            if itemSoup.title is not None:
                itemTitle = itemSoup.title.text.strip()
                print('着陆页Title:', itemTitle)
                allTitleStr += itemTitle

titleWords = [word for word in jieba.lcut(allTitleStr, cut_all=False) if len(word) > 1]
print(titleWords)

7. 得出标题高频词汇

通过使用Counter库对所有词汇进行词频统计,并使用python对字典进行倒序排列,以找到最高频的那些词汇。

import requests
import chardet
from bs4 import BeautifulSoup
import jieba
from collections import Counter

url = 'http://www.baidu.com/s?wd=月亮虾饼怎么做&rn=50'
response = requests.get(url)
htmlEncoded = response.content
detectResult = chardet.detect(htmlEncoded)
encoding = detectResult['encoding']
html = str(htmlEncoded, encoding)
soup = BeautifulSoup(html, 'html.parser')
items = soup.select('h3 a')
allTitleStr = ''
for item in items:
    resultRedirectUrl = item.attrs['href']
    if 'http://' in resultRedirectUrl or 'https://' in resultRedirectUrl:
        itemHeadRes = requests.head(resultRedirectUrl, verify=False)
        itemUrl = itemHeadRes.headers['Location']
        itemRes = requests.get(itemUrl, verify=False)
        if itemRes.status_code == 200:
            itemHtmlEncoding = chardet.detect(itemRes.content)['encoding']
            itemHtml = str(itemRes.content, itemHtmlEncoding, errors='ignore')
            itemSoup = BeautifulSoup(itemHtml, 'html.parser')
            if itemSoup.title is not None:
                itemTitle = itemSoup.title.text.strip()
                print('着陆页Title:', itemTitle)
                allTitleStr += itemTitle

titleWords = [word for word in jieba.lcut(allTitleStr, cut_all=False) if len(word) > 1]
titleWordsDic = dict(Counter(titleWords))
titleWordsSortedList = sorted(titleWordsDic.items(), key=lambda x: x[1], reverse=True)
for item in titleWordsSortedList:
    print(item[0], ':', item[1])

8. 最终组成页面标题Title

将得到的高频词汇结合成一句话作为最终的页面Title。

import requests
import chardet
from bs4 import BeautifulSoup
import jieba
from collections import Counter

url = 'http://www.baidu.com/s?wd=月亮虾饼怎么做&rn=50'
response = requests.get(url)
htmlEncoded = response.content
detectResult = chardet.detect(htmlEncoded)
encoding = detectResult['encoding']
html = str(htmlEncoded, encoding)
soup = BeautifulSoup(html, 'html.parser')
items = soup.select('h3 a')
allTitleStr = ''
for item in items:
    resultRedirectUrl = item.attrs['href']
    if 'http://' in resultRedirectUrl or 'https://' in resultRedirectUrl:
        itemHeadRes = requests.head(resultRedirectUrl, verify=False)
        itemUrl = itemHeadRes.headers['Location']
        itemRes = requests.get(itemUrl, verify=False)
        if itemRes.status_code == 200:
            itemHtmlEncoding = chardet.detect(itemRes.content)['encoding']
            itemHtml = str(itemRes.content, itemHtmlEncoding, errors='ignore')
            itemSoup = BeautifulSoup(itemHtml, 'html.parser')
            if itemSoup.title is not None:
                itemTitle = itemSoup.title.text.strip()
                print('着陆页Title:', itemTitle)
                allTitleStr += itemTitle

titleWords = [word for word in jieba.lcut(allTitleStr, cut_all=False) if len(word) > 1]
titleWordsDic = dict(Counter(titleWords))
titleWordsSortedList = sorted(titleWordsDic.items(), key=lambda x: x[1], reverse=True)
title = "".join([item[0] for item in titleWordsSortedList[:5]])
print(title)

本页网址:https://www.xinzhibang.net/article_detail-16642.html

寻求报道,请 点击这里 微信扫码咨询

关键词

SEO 编程 页面Title 爬虫 分词 词频

分享至微信: 微信扫码阅读

相关工具

相关文章