文章詳情頁

python 網頁解析器掌握第三方 lxml 擴展庫與 xpath 的使用方法

瀏覽：36日期：2022-06-23 11:51:54

今天說的則是使用另外一種擴展庫 lxml 來對網頁完成解析。同樣的，lxml 庫能完成對 html、xml 格式的文件解析，并且能夠用來解析大型的文檔、解析速度也是相對比較快的。

要掌握 lxml 的使用，就需要掌握掌握 xpath 的使用方法，因為 lxml 擴展庫就是基于 xpath 的，所以這一章的重點主要還是對 xpath 語法使用的說明。

1、導入 lxml 擴展庫、并創建對象

# -*- coding: UTF-8 -*-# 從 lxml 導入 etreefrom lxml import etree# 首先獲取到網頁下載器已經下載到的網頁源代碼# 這里直接取官方的案例html_doc = '''<html><head><title>The Dormouse’s story</title></head><body>The Dormouse’s storyOnce upon a time there were three little sisters; and their names were<a rel='external nofollow' id='link1'>Elsie</a>,<a rel='external nofollow' id='link2'>Lacie</a> and<a rel='external nofollow' id='link3'>Tillie</a>;and they lived at the bottom of a well....'''# 初始化網頁下載器的 html_doc 字符串,返回一個 lxml 的對象html = etree.HTML(html_doc)2、使用 xpath 語法提取網頁元素

按照節點的方式獲取元素

# xpath() 使用標簽節點的方式獲取元素print html.xpath(’/html/body/p’)# [<Element p at 0x2ebc908>, <Element p at 0x2ebc8c8>, <Element p at 0x2eb9a48>]print html.xpath(’/html’)# [<Element html at 0x34bc948>]# 在當前節點的子孫節點中查找 a 節點print html.xpath(’//a’)# 在當前節點的子節點中查找 html 節點print html.xpath(’/html’)

按照篩選的方式獲取元素

’’’根據單一屬性獲取元素’’’# 獲取子孫節點中,屬性 class=bro 的 a 標簽print html.xpath(’//a[@class='bro']’)# 獲取子孫節點中,屬性 id=link3 的 a 標簽print html.xpath(’//a[@id='link3']’)’’’根據多個屬性獲取元素’’’# 獲取class屬性等于sister，并且id等于link3的a標簽print html.xpath(’//a[contains(@class,'sister') and contains(@id,'link1')]’)# 獲取class屬性等于bro，或者id等于link1的a標簽print html.xpath(’//a[contains(@class,'bro') or contains(@id,'link1')]’)# 使用 last() 函數，獲取子孫代的a標簽的最后一個a標簽print html.xpath(’//a[last()]’)# 使用 1 函數，獲取子孫代的a標簽的第一個a標簽print html.xpath(’//a[1]’)# 標簽篩選，position()獲取子孫代的a標簽的前兩個a標簽print html.xpath(’//a[position() < 3]’)’’’使用計算的方式，獲取多個元素’’’# 標簽篩選，position()獲取子孫代的a標簽的第一個與第三個標簽# 可以使用的計算表達式：>、<、=、>=、<=、+、-、and、orprint html.xpath(’//a[position() = 1 or position() = 3]’)

獲取元素的屬性與文本

’’’使用@獲取屬性值，使用text() 獲取標簽文本’’’# 獲取屬性值print html.xpath(’//a[position() = 1]/@class’)# [’sister’]# 獲取標簽的文本值print html.xpath(’//a[position() = 1]/text()’)

到此這篇關于python 網頁解析器掌握第三方 lxml 擴展庫與 xpath 的使用方法的文章就介紹到這了,更多相關python lxml 擴展庫與 xpath內容請搜索好吧啦網以前的文章或繼續瀏覽下面的相關文章希望大家以后多多支持好吧啦網！

Python 編程

上一條：python基于tkinter制作圖形界面的2048游戲下一條：如何用python做逐步回歸

相關文章：

1. asp取整數mod 有小數的就自動加12. 詳解瀏覽器的緩存機制3. CSS3中Transition屬性詳解以及示例分享4. 詳解盒子端CSS動畫性能提升5. 利用CSS制作3D動畫6. 怎樣才能用js生成xmldom對象，并且在firefox中也實現xml數據島？7. CSS hack用法案例詳解8. 怎樣打開XML文件？xml文件如何打開?9. css代碼優化的12個技巧10. XML入門的常見問題(二)

排行榜

					
					idea設置自動導入依賴的方法步驟
IntelliJ IDEA 2020最新注冊碼(親測有效,可激活至 2089 年)
phpstudy apache開啟ssi使用詳解
css代碼優化的12個技巧
ASP.NET MVC擴展帶驗證的單選按鈕
利用CSS制作3D動畫
XML入門的常見問題(二)
Android Studio 3.6 正式版終于發布了,快來圍觀
asp取整數mod 有小數的就自動加1
asp中response.write("中文")或者js中文亂碼問題
詳談ajax返回數據成功 卻進入error的方法