文章詳情頁

python 提取html文本的方法

瀏覽：7日期：2022-06-19 08:44:15

假設我們需要從各種網頁中提取全文，并且要剝離所有HTML標記。通常，默認解決方案是使用BeautifulSoup軟件包中的get_text方法，該方法內部使用lxml。這是一個經過充分測試的解決方案，但是在處理成千上萬個HTML文檔時可能會非常慢。通過用selectolax替換BeautifulSoup，您幾乎可以免費獲得5-30倍的加速！這是一個簡單的基準測試，可分析commoncrawl(`處理NLP問題時，有時您需要獲得大量的文本集。互聯網是文本的最大來源，但是不幸的是，從任意HTML頁面提取文本是一項艱巨而痛苦的任務。假設我們需要從各種網頁中提取全文，并且要剝離所有HTML標記。通常，默認解決方案是使用BeautifulSoup軟件包中的get_text方法，該方法內部使用lxml。這是一個經過充分測試的解決方案，但是在處理成千上萬個HTML文檔時可能會非常慢。通過用selectolax替換BeautifulSoup，您幾乎可以免費獲得5-30倍的加速！這是一個簡單的基準測試，可分析commoncrawl(https://commoncrawl.org/)的10,000個HTML頁面：

# coding: utf-8from time import timeimport warcfrom bs4 import BeautifulSoupfrom selectolax.parser import HTMLParserdef get_text_bs(html): tree = BeautifulSoup(html, ’lxml’) body = tree.body if body is None:return None for tag in body.select(’script’):tag.decompose() for tag in body.select(’style’):tag.decompose() text = body.get_text(separator=’n’) return textdef get_text_selectolax(html): tree = HTMLParser(html) if tree.body is None:return None for tag in tree.css(’script’):tag.decompose() for tag in tree.css(’style’):tag.decompose() text = tree.body.text(separator=’n’) return textdef read_doc(record, parser=get_text_selectolax): url = record.url text = None if url:payload = record.payload.read()header, html = payload.split(b’rnrn’, maxsplit=1)html = html.strip()if len(html) > 0: text = parser(html) return url, textdef process_warc(file_name, parser, limit=10000): warc_file = warc.open(file_name, ’rb’) t0 = time() n_documents = 0 for i, record in enumerate(warc_file):url, doc = read_doc(record, parser)if not doc or not url: continuen_documents += 1if i > limit: break warc_file.close() print(’Parser: %s’ % parser.__name__) print(’Parsing took %s seconds and produced %s documentsn’ % (time() - t0, n_documents))

>>> ! wget https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-05/segments/1516084886237.6/warc/CC-MAIN-20180116070444-20180116090444-00000.warc.gz>>> file_name = 'CC-MAIN-20180116070444-20180116090444-00000.warc.gz'>>> process_warc(file_name, get_text_selectolax, 10000)Parser: get_text_selectolaxParsing took 16.170367002487183 seconds and produced 3317 documents>>> process_warc(file_name, get_text_bs, 10000)Parser: get_text_bsParsing took 432.6902508735657 seconds and produced 3283 documents

顯然，這并不是對某些事物進行基準測試的最佳方法，但是它提供了一個想法，即selectolax有時比lxml快30倍。selectolax最適合將HTML剝離為純文本。如果我有10,000多個HTML片段，需要將它們作為純文本索引到Elasticsearch中。（Elasticsearch有一個html_strip文本過濾器，但這不是我想要/不需要在此上下文中使用的過濾器）。事實證明，以這種規模將HTML剝離為純文本實際上是非常低效的。那么，最有效的方法是什么？

PyQuery

from pyquery import PyQuery as pqtext = pq(html).text() selectolax

from selectolax.parser import HTMLParsertext = HTMLParser(html).text() 正則表達式

import reregex = re.compile(r’<.*?>’)text = clean_regex.sub(’’, html)結果

我編寫了一個腳本來計算時間，該腳本遍歷包含HTML片段的10,000個文件。注意！這些片段不是完整的<html>文檔（帶有<head>和<body>等），只是HTML的一小部分。平均大小為10,314字節（中位數為5138字節）。結果如下：

pyquery SUM: 18.61 seconds MEAN: 1.8633 ms MEDIAN: 1.0554 msselectolax SUM: 3.08 seconds MEAN: 0.3149 ms MEDIAN: 0.1621 msregex SUM: 1.64 seconds MEAN: 0.1613 ms MEDIAN: 0.0881 ms

我已經運行了很多次，結果非常穩定。重點是：selectolax比PyQuery快7倍。

正則表達式好用？真的嗎？

對于最基本的HTML Blob，它可能工作得很好。實際上，如果HTML是<p> Foo＆amp; Bar </ p>，我希望純文本轉換應該是Foo＆Bar，而不是Foo＆amp; bar。更重要的一點是，PyQuery和selectolax支持非常特定但對我的用例很重要的內容。在繼續之前，我需要刪除某些標簽（及其內容）。例如：

<h4 class='warning'>This should get stripped.</h4><p>Please keep.</p><div style='display: none'>This should also get stripped.</div>

正則表達式永遠無法做到這一點。

2.0 版本

因此，我的要求可能會發生變化，但基本上，我想刪除某些標簽。例如：<div class =“ warning”> 、 <div class =“ hidden”> 和 <div style =“ display：none”>。因此，讓我們實現一下：

PyQuery

from pyquery import PyQuery as pq_display_none_regex = re.compile(r’display:s*none’)doc = pq(html)doc.remove(’div.warning, div.hidden’)for div in doc(’div[style]’).items(): style_value = div.attr(’style’) if _display_none_regex.search(style_value):div.remove()text = doc.text() selectolax

from selectolax.parser import HTMLParser_display_none_regex = re.compile(r’display:s*none’)tree = HTMLParser(html)for tag in tree.css(’div.warning, div.hidden’): tag.decompose()for tag in tree.css(’div[style]’): style_value = tag.attributes[’style’] if style_value and _display_none_regex.search(style_value):tag.decompose()text = tree.body.text()

這實際上有效。當我現在為10,000個片段運行相同的基準時，新結果如下：

pyquery SUM: 21.70 seconds MEAN: 2.1701 ms MEDIAN: 1.3989 msselectolax SUM: 3.59 seconds MEAN: 0.3589 ms MEDIAN: 0.2184 msregex Skip

同樣，selectolax擊敗PyQuery約6倍。

結論

正則表達式速度快，但功能弱。selectolax的效率令人印象深刻。

以上就是python 提取html文本的方法的詳細內容，更多關于python 提取html文本的資料請關注好吧啦網其它相關文章！

Python 編程

上一條：Python快速優雅的批量修改Word文檔樣式下一條：Python 京東云無線寶消息推送功能

相關文章：

1. Python實現迪杰斯特拉算法過程解析2. JavaScript Reduce使用詳解3. 淺談JavaScript中等號、雙等號、三等號的區別4. Spring security 自定義過濾器實現Json參數傳遞并兼容表單參數(實例代碼)5. 詳解Python模塊化編程與裝飾器6. python使用ctypes庫調用DLL動態鏈接庫7. Python如何進行時間處理8. python裝飾器三種裝飾模式的簡單分析9. JavaScript中的AOP編程的基本實現10. 詳解java中static關鍵詞的作用

排行榜

					
					Spring security 自定義過濾器實現Json參數傳遞并兼容表單參數(實例代碼)
Python實現迪杰斯特拉算法過程解析
詳解java中static關鍵詞的作用
詳解Python模塊化編程與裝飾器
Django框架安裝及項目創建過程解析
python裝飾器三種裝飾模式的簡單分析
JXTA概念介紹-Matrix翻譯
Django實現任意文件上傳（最簡單的方法）
java結構性模式之變壓器模式介紹(二)
JavaScript Reduce使用詳解
Python如何進行時間處理