文章詳情頁(yè)

Python基于百度AI實(shí)現(xiàn)抓取表情包

瀏覽：129日期：2022-06-14 15:30:11

目錄一、百度 AI 開放平臺(tái)的 Key 申請(qǐng)方法二、抓取貼吧表情包三、使用 Baidu-aip

本文先抓取網(wǎng)絡(luò)上的表情圖像，然后利用百度 AI 識(shí)別表情包上的說(shuō)明文字，并利用表情文字重命名文件，這樣當(dāng)發(fā)表情包時(shí)，不需要逐個(gè)打開查找，直接根據(jù)文件名選擇表情并發(fā)送。

一、百度 AI 開放平臺(tái)的 Key 申請(qǐng)方法

本例使用了百度 AI 的 API 接口實(shí)現(xiàn)文字識(shí)別。因此需要先申請(qǐng)對(duì)應(yīng)的 API 使用權(quán)限，具體步驟如下：

在網(wǎng)頁(yè)瀏覽器(比如 Chrome 或者火狐) 的地址欄中輸入 ai.baidu.com，進(jìn)入到百度云 AI 的官網(wǎng)，在該頁(yè)面中單擊右上角的控制臺(tái) 按鈕。

Python基于百度AI實(shí)現(xiàn)抓取表情包

進(jìn)入到百度云 AI 官網(wǎng)的登錄頁(yè)面，輸入百度賬號(hào)和密碼，如果沒(méi)有，可以單擊立即注冊(cè) 超鏈接進(jìn)行注冊(cè)申請(qǐng)。

登錄成功后，進(jìn)入到百度云 AI 官網(wǎng)的控制臺(tái)頁(yè)面，單擊左側(cè)導(dǎo)航的產(chǎn)品服務(wù)，展開列表，在列表的最右側(cè)下方看到有人工智能的分類，然后選擇圖像識(shí)別，或者直接選擇文字識(shí)別，如下圖所示。

Python基于百度AI實(shí)現(xiàn)抓取表情包

進(jìn)入圖像識(shí)別一概覽頁(yè)面，要使用百度云 AI 的 API，首先需要申請(qǐng)權(quán)限，申請(qǐng)權(quán)限之前需要先創(chuàng)建自己的應(yīng)用，因此單擊創(chuàng)建應(yīng)用按鈕，如下圖所示。

Python基于百度AI實(shí)現(xiàn)抓取表情包

進(jìn)入到創(chuàng)建應(yīng)用頁(yè)面，該頁(yè)面中需要輸入應(yīng)用的名稱，選擇應(yīng)用類型，并選擇接口，注意：這里的接口可以多選擇一些，把后期可能用到的接口全部選擇上，這樣，在開發(fā)其他實(shí)例時(shí)，就可以直接使用了；選擇完接口后，選擇文字識(shí)別包名，這里選擇不需要，輸入應(yīng)用描述，單擊立即創(chuàng)建按鈕，如下圖所示。

Python基于百度AI實(shí)現(xiàn)抓取表情包

創(chuàng)建完成后，單擊返回應(yīng)用列表按鈕，頁(yè)面跳轉(zhuǎn)到應(yīng)用列表頁(yè)面，在該頁(yè)面中即可查看創(chuàng)建的應(yīng)用，以及百度云自動(dòng)為您分配的 AppID，API Key，Secret Key，這些值根據(jù)應(yīng)用的不同而不同，因此一定要保存好，以便開發(fā)時(shí)使用。

Python基于百度AI實(shí)現(xiàn)抓取表情包

二、抓取貼吧表情包

本例在百度貼吧中找到了一些自制的表情包：https://tieba.baidu.com/p/5522091060現(xiàn)在想把圖片都爬下來(lái)，具體操作步驟如下：

Network 抓包看下返回的數(shù)據(jù)是否和 Element 一致，即是否包含想要的數(shù)據(jù)，而不是通過(guò) JS 黑魔法進(jìn)行加載的。復(fù)制下第一個(gè)圖的圖片鏈接，到 Network 選項(xiàng)卡里的 Response 里查找一下。

Python基于百度AI實(shí)現(xiàn)抓取表情包

在 Network 抓包中沒(méi)有發(fā)現(xiàn) Ajax 動(dòng)態(tài)加載數(shù)據(jù)的蹤跡。

點(diǎn)擊第二頁(yè)，抓包發(fā)現(xiàn)了 Ajax 加載的痕跡。

Python基于百度AI實(shí)現(xiàn)抓取表情包

以第一個(gè)圖的 url 搜下，同樣可以找到。

三個(gè)參數(shù)猜測(cè) pn 為 page_number，即頁(yè)數(shù)，postman 或者自己寫代碼模擬請(qǐng)求，記得塞入 Host 和 X-Requested-With，驗(yàn)證 pn=1 是否為第一頁(yè)數(shù)據(jù)，驗(yàn)證通過(guò)，即所有頁(yè)面數(shù)據(jù)都可以通過(guò)這個(gè)接口拿到。

先加載拿到末頁(yè)是第幾頁(yè)，然后走一波循環(huán)遍歷即可解析數(shù)據(jù)獲得圖片 url，寫入文件，使用多個(gè)線程進(jìn)行下載，詳細(xì)代碼如下。

# 抓取百度貼吧某個(gè)帖子里的所有圖片import requestsimport timeimport threadingimport queuefrom bs4 import BeautifulSoupimport chardetimport ostiezi_url = 'https://tieba.baidu.com/p/5522091060'headers = { ’Host’: ’tieba.baidu.com’, ’User-Agent’: ’User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KH’ ’TML, like Gecko) Chrome/90.0.4430.212 Safari/537.36’,}pic_save_dir = ’tiezi_pic/’if not os.path.exists(pic_save_dir): # 判斷文件夾是否存在，不存在就創(chuàng)建 os.makedirs(pic_save_dir)pic_urls_file = ’tiezi_pic_urls.txt’download_q = queue.Queue() # 下載隊(duì)列# 獲得頁(yè)數(shù)def get_page_count(): try:resp = requests.get(tiezi_url, headers=headers, timeout=5)if resp is not None: resp.encoding = chardet.detect(resp.content)[’encoding’] html = resp.text soup = BeautifulSoup(html, ’lxml’) a_s = soup.find('ul', attrs={’class’: ’l_posts_num’}).findAll('a') for a in a_s:if a.get_text() == ’尾頁(yè)’: return a[’href’].split(’=’)[1] except Exception as e:print(str(e))# 下載線程class PicSpider(threading.Thread): def __init__(self, t_name, func):self.func = functhreading.Thread.__init__(self, name=t_name) def run(self):self.func()# 獲得每頁(yè)里的所有圖片URLdef get_pics(count): params = {’pn’: count,’ajax’: ’1’,’t’: int(time.time()) } try:resp = requests.get(tiezi_url, headers=headers, timeout=5, params=params)if resp is not None: resp.encoding = chardet.detect(resp.content)[’encoding’] html = resp.text soup = BeautifulSoup(html, ’lxml’) imgs = soup.findAll(’img’, attrs={’class’: ’BDE_Image’}) for img in imgs:print(img[’src’])with open(pic_urls_file, ’a’) as fout: fout.write(img[’src’]) fout.write(’n’) return None except Exception:pass# 下載線程調(diào)用的方法def down_pics(): global download_q while not download_q.empty():data = download_q.get()download_pic(data)download_q.task_done()# 下載調(diào)用的方法def download_pic(img_url): try:resp = requests.get(img_url, headers=headers, timeout=10)if resp.status_code == 200: print('下載圖片:' + img_url) pic_name = img_url.split('/')[-1][0:-1] with open(pic_save_dir + pic_name, 'wb+') as f:f.write(resp.content) except Exception as e:print(e)if __name__ == ’__main__’: print('檢索判斷鏈接文件是否存在：') if not os.path.exists(pic_urls_file):print('不存在，開始解析帖子...')page_count = get_page_count()if page_count is not None: headers[’X-Requested-With’] = ’XMLHttpRequest’ for page in range(1, int(page_count) + 1):get_pics(page)print('鏈接已解析完畢！')headers.pop(’X-Requested-With’) else:print('存在') print('開始下載圖片~~~~') headers[’Host’] = ’imgsa.baidu.com’ fo = open(pic_urls_file, 'r') pic_list = fo.readlines() threads = [] for pic in pic_list:download_q.put(pic) for i in range(0, len(pic_list)):t = PicSpider(t_name=’線程’ + str(i), func=down_pics)t.daemon = Truet.start()threads.append(t) download_q.join() for t in threads:t.join() print('圖片下載完畢')

運(yùn)行結(jié)果：

Python基于百度AI實(shí)現(xiàn)抓取表情包

下面通過(guò) OCR 文字識(shí)別技術(shù)，直接把表情里的文字提出來(lái)，然后來(lái)命名圖片，這樣就可以直接文件搜索表情關(guān)鍵字，可以快速找到需要的表情圖片。使用谷歌的 OCR 文字識(shí)別引擎：Tesseract，對(duì)于此類大圖片小文字，不太適合，識(shí)別率太低，甚至無(wú)法識(shí)別，這時(shí)使用百度云 OCR 比較合適，它能夠自動(dòng)定位到圖片中具體位置，并找出圖片中所有的文字。

三、使用 Baidu-aip

申請(qǐng)百度 AI 的應(yīng)用 key 之后，就可以在本地系統(tǒng)中安裝 Baidu-aip，代碼如下：

pip install baidu-aip

先識(shí)別一張圖片，看看效果如何：

from aip import AipOcr# 新建一個(gè)AipOcr對(duì)象config = { ’appId’: ’填寫自己的appId’, ’apiKey’: ’填寫自己的apiKey’, ’secretKey’: ’填寫自己的secretKey’}client = AipOcr(**config)# 識(shí)別圖片里的文字def img_to_str(image_path): # 讀取圖片 with open(image_path, ’rb’) as fp:image = fp.read()# 調(diào)用通用文字識(shí)別, 圖片參數(shù)為本地圖片 result = client.basicGeneral(image) # 返回拼接結(jié)果 if ’words_result’ in result:return ’n’.join([w[’words’] for w in result[’words_result’]])if __name__ == ’__main__’: print(img_to_str(’tiezi_pic/5c0ddb1e4134970aebd593e29ecad1c8a5865dbd.jpg’))

運(yùn)行程序，結(jié)果如下圖所示：

Python基于百度AI實(shí)現(xiàn)抓取表情包

百度 AI 返回的是一個(gè) JSON 格式數(shù)據(jù)，如下所示。返回一個(gè)字典對(duì)象，包含 log_id、words_result_num、words_result 三個(gè)鍵，其中 words_result_num 表示識(shí)別的文本行數(shù)，words_result 是一個(gè)列表，每個(gè)列表項(xiàng)目記錄一條識(shí)別的文本，每個(gè)項(xiàng)目返回一個(gè)字典對(duì)象，包含 words 鍵，words 表示識(shí)別的文本。

{’words_result’: [{’words’: ’o。o’}, {’words’: ’6226-16:59’}, {’words’: ’絕望jpg’}], ’log_id’: 1393611954748129280, ’words_result_num’: 3}o。o6226-16:59絕望jpg

由于每個(gè)圖片中可能包含很多文字信息，如水印的日期文字，以及個(gè)別特殊的文字符號(hào)被誤解析，我們需要提出的是漢字或字母信息，同時(shí)可能會(huì)包含多條漢字信息，本例選擇漢字或字母最長(zhǎng)的一條來(lái)命名文件。完整的示例代碼如下：

# 識(shí)別圖片文字，批量命名圖片文字import osfrom aip import AipOcrimport reimport datetime# 新建一個(gè)AipOcr對(duì)象config = { ’appId’: ’填寫自己的appId’, ’apiKey’: ’填寫自己的apiKey’, ’secretKey’: ’填寫自己的secretKey’}client = AipOcr(**config)pic_dir = r'tiezi_pic/'# 讀取圖片def get_file_content(file_path): with open(file_path, ’rb’) as fp:return fp.read()# 識(shí)別圖片里的文字def img_to_str(image_path): image = get_file_content(image_path) # 調(diào)用通用文字識(shí)別, 圖片參數(shù)為本地圖片 result = client.basicGeneral(image) # 結(jié)果拼接返回 words_list = [] if ’words_result’ in result:if len(result[’words_result’]) > 0: for w in result[’words_result’]:words_list.append(w[’words’]) file_name = get_longest_str(words_list) print(file_name) file_dir_name = pic_dir + str(file_name).replace('/', '') + ’.jpg’ if os.path.exists(file_dir_name): # 處理文件重名問(wèn)題sec = datetime.datetime.now().microsecond # 獲取當(dāng)前毫秒時(shí)值file_dir_name = pic_dir + str(file_name).replace('/', '') + str(sec) + ’.jpg’ try:os.rename(image_path, file_dir_name) except Exception:print(' 重命名失?。?, image_path, ' => ', file_name)# 獲取字符串列表中最長(zhǎng)的字符串def get_longest_str(str_list): pat = re.compile(r’[u4e00-u9fa5A-Za-z]+’) str = max(str_list, key=hanzi_len) result = pat.findall(str) return ’’.join(result)def hanzi_len(item): pat = re.compile(r’[u4e00-u9fa5]+’) sum = 0 for i in item:if pat.search(i): sum += 1 return sum# 遍歷某個(gè)文件夾下所有圖片def query_picture(dir_path): pic_path_list = [] for filename in os.listdir(dir_path):pic_path_list.append(dir_path + filename) return pic_path_listif __name__ == ’__main__’: pic_list = query_picture(pic_dir) if len(pic_list) > 0:for i in pic_list: img_to_str(i)

運(yùn)行程序，結(jié)果如下圖所示：

Python基于百度AI實(shí)現(xiàn)抓取表情包

到此這篇關(guān)于Python基于百度AI實(shí)現(xiàn)抓取表情包的文章就介紹到這了,更多相關(guān)Python 抓取表情包內(nèi)容請(qǐng)搜索好吧啦網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持好吧啦網(wǎng)！

上一條：Python利用百度地圖獲取兩地距離(附demo)下一條：Python快速將ppt制作成配音視頻課件的操作方法

相關(guān)文章：

1. JSP之表單提交get和post的區(qū)別詳解及實(shí)例2. WML語(yǔ)言的基本情況3. 存儲(chǔ)于xml中需要的HTML轉(zhuǎn)義代碼4. jsp文件下載功能實(shí)現(xiàn)代碼5. .Net加密神器Eazfuscator.NET?2023.2?最新版使用教程6. ASP動(dòng)態(tài)網(wǎng)頁(yè)制作技術(shù)經(jīng)驗(yàn)分享7. python多線程和多進(jìn)程關(guān)系詳解8. 詳解瀏覽器的緩存機(jī)制9. Xml簡(jiǎn)介_動(dòng)力節(jié)點(diǎn)Java學(xué)院整理10. Python 實(shí)現(xiàn)勞拉游戲的實(shí)例代碼（四連環(huán)、重力四子棋）

排行榜

					
					.Net加密神器Eazfuscator.NET?2023.2?最新版使用教程
python爬蟲把url鏈接編碼成gbk2312格式過(guò)程解析
Docker容器如何更新打包并上傳到阿里云
Python xlrd/xlwt 創(chuàng)建excel文件及常用操作
Spring Cloud Alibaba整合Sentinel的實(shí)現(xiàn)步驟
python多線程和多進(jìn)程關(guān)系詳解
python 寫函數(shù)在一定條件下需要調(diào)用自身時(shí)的寫法說(shuō)明
JSP之表單提交get和post的區(qū)別詳解及實(shí)例
jsp文件下載功能實(shí)現(xiàn)代碼
ASP動(dòng)態(tài)網(wǎng)頁(yè)制作技術(shù)經(jīng)驗(yàn)分享
WML語(yǔ)言的基本情況