Hướng dẫn XML Python với ElementTree: Dành cho người mới bắt đầu

Tìm hiểu cách bạn có thể phân tích, khám phá, chỉnh sửa và điền dữ liệu cho các tệp XML với gói ElementTree của Python, vòng lặp for và biểu thức XPath.

Đã cập nhật 5 thg 6, 2026 · 10 phút đọc

Khám phá với AI

Mở trong ChatGPT Mở trong Claude Mở trong Perplexity

Chạy và chỉnh sửa mã từ hướng dẫn trực tuyến này.

Chạy mã

Là một nhà khoa học dữ liệu, bạn sẽ thấy rằng hiểu XML rất hữu ích cho cả web-scraping và thực hành chung trong việc phân tích một tài liệu có cấu trúc.

Bạn sẽ tìm hiểu thêm về XML và được giới thiệu gói ElementTree của Python.
Tiếp theo, bạn sẽ khám phá cách duyệt cây XML để hiểu dữ liệu mình đang làm việc tốt hơn với sự trợ giúp của các hàm ElementTree, vòng lặp for và biểu thức XPath.
Sau đó, bạn sẽ học cách chỉnh sửa một tệp XML.
Bạn cũng sẽ sử dụng các biểu thức xpath để điền dữ liệu vào các tệp XML.

XML là gì?

XML là viết tắt của "Extensible Markup Language" (Ngôn ngữ Đánh dấu Mở rộng). Nó chủ yếu được dùng trong các trang web, nơi dữ liệu có cấu trúc cụ thể và được khuôn khổ XML diễn giải một cách linh hoạt.

XML tạo ra một cấu trúc dạng cây dễ diễn giải và hỗ trợ phân cấp. Bất cứ khi nào một trang tuân theo XML, nó có thể được gọi là một tài liệu XML.

Tài liệu XML có các phần gọi là phần tử (element), được xác định bởi thẻ mở đầu và kết thúc. Thẻ là một cấu trúc đánh dấu bắt đầu bằng < và kết thúc bằng >. Ký tự nằm giữa thẻ mở và thẻ đóng, nếu có, là nội dung của phần tử. Các phần tử có thể chứa đánh dấu, bao gồm các phần tử khác, gọi là "phần tử con".
Phần tử lớn nhất ở cấp cao nhất được gọi là gốc (root), chứa tất cả các phần tử khác.
Thuộc tính (attribute) là cặp tên–giá trị tồn tại trong thẻ mở hoặc thẻ phần tử rỗng. Một thuộc tính XML chỉ có một giá trị và mỗi thuộc tính xuất hiện tối đa một lần trên mỗi phần tử.

Để hiểu rõ hơn, hãy xem tệp XML (rút gọn) sau:

<?xml version="1.0"?>
<collection>
    <genre category="Action">
        <decade years="1980s">
            <movie favorite="True" title="Indiana Jones: The raiders of the lost Ark">
                <format multiple="No">DVD</format>
                <year>1981</year>
                <rating>PG</rating>
                <description>
                'Archaeologist and adventurer Indiana Jones 
                is hired by the U.S. government to find the Ark of the 
                Covenant before the Nazis.'
                </description>
            </movie>
               <movie favorite="True" title="THE KARATE KID">
               <format multiple="Yes">DVD,Online</format>
               <year>1984</year>
               <rating>PG</rating>
               <description>None provided.</description>
            </movie>
            <movie favorite="False" title="Back 2 the Future">
               <format multiple="False">Blu-ray</format>
               <year>1985</year>
               <rating>PG</rating>
               <description>Marty McFly</description>
            </movie>
        </decade>
        <decade years="1990s">
            <movie favorite="False" title="X-Men">
               <format multiple="Yes">dvd, digital</format>
               <year>2000</year>
               <rating>PG-13</rating>
               <description>Two mutants come to a private academy for their kind whose resident superhero team must 
               oppose a terrorist organization with similar powers.</description>
            </movie>
            <movie favorite="True" title="Batman Returns">
               <format multiple="No">VHS</format>
               <year>1992</year>
               <rating>PG13</rating>
               <description>NA.</description>
            </movie>
               <movie favorite="False" title="Reservoir Dogs">
               <format multiple="No">Online</format>
               <year>1992</year>
               <rating>R</rating>
               <description>WhAtEvER I Want!!!?!</description>
            </movie>
        </decade>    
    </genre>

    <genre category="Thriller">
        <decade years="1970s">
            <movie favorite="False" title="ALIEN">
                <format multiple="Yes">DVD</format>
                <year>1979</year>
                <rating>R</rating>
                <description>"""""""""</description>
            </movie>
        </decade>
        <decade years="1980s">
            <movie favorite="True" title="Ferris Bueller's Day Off">
                <format multiple="No">DVD</format>
                <year>1986</year>
                <rating>PG13</rating>
                <description>Funny movie about a funny guy</description>
            </movie>
            <movie favorite="FALSE" title="American Psycho">
                <format multiple="No">blue-ray</format>
                <year>2000</year>
                <rating>Unrated</rating>
                <description>psychopathic Bateman</description>
            </movie>
        </decade>
    </genre>

Từ những gì bạn đã đọc ở trên, bạn thấy rằng

<collection> là phần tử gốc duy nhất: nó chứa tất cả các phần tử khác, chẳng hạn như <genre> hoặc <movie>, là các phần tử con hoặc phần tử phụ. Như bạn thấy, các phần tử này được lồng nhau.

Lưu ý rằng các phần tử con này cũng có thể đóng vai trò là cha và chứa phần tử con của riêng chúng, khi đó được gọi là "phần tử cháu".

Bạn sẽ thấy, ví dụ, phần tử <movie> chứa một vài "thuộc tính", như favorite title cung cấp thêm thông tin!

Với phần giới thiệu ngắn gọn về các tệp XML này, bạn đã sẵn sàng tìm hiểu thêm về ElementTree!

Giới thiệu về ElementTree

Cấu trúc cây XML giúp điều hướng, chỉnh sửa và loại bỏ bằng lập trình tương đối đơn giản. Python có một thư viện tích hợp sẵn, ElementTree, với các hàm để đọc và thao tác XML (và các tệp có cấu trúc tương tự khác).

Đầu tiên, nhập ElementTree. Thông lệ là sử dụng bí danh ET:

import xml.etree.ElementTree as ET

Phân tích dữ liệu XML

Tệp XML được cung cấp mô tả một bộ sưu tập phim cơ bản. Vấn đề duy nhất là dữ liệu rất lộn xộn! Có nhiều người phụ trách bộ sưu tập này và mỗi người có cách nhập dữ liệu riêng. Mục tiêu chính trong hướng dẫn này là đọc và hiểu tệp bằng Python rồi sửa các vấn đề.

Trước hết, bạn cần đọc tệp với ElementTree.

tree = ET.parse('movies.xml')
root = tree.getroot()

Bây giờ bạn đã khởi tạo cây, bạn nên xem XML và in các giá trị để hiểu cấu trúc của cây.

Mọi phần của cây (bao gồm gốc) đều có một thẻ mô tả phần tử. Bên cạnh đó, như bạn đã thấy ở phần giới thiệu, các phần tử có thể có thuộc tính, là các bộ mô tả bổ sung, đặc biệt hữu ích khi một thẻ được dùng lặp lại. Thuộc tính cũng giúp xác thực các giá trị được nhập cho thẻ đó, một lần nữa góp phần tạo nên định dạng có cấu trúc của XML.

Bạn sẽ thấy sau trong hướng dẫn này rằng các thuộc tính có thể khá mạnh khi được đưa vào XML!

root.tag

'collection'

Ở cấp cao nhất, bạn thấy rằng XML này có gốc là thẻ collection.

root.attrib

{}

Vậy là gốc không có thuộc tính.

Vòng lặp for

Bạn có thể lặp dễ dàng qua các phần tử con (thường gọi là "children") trong gốc bằng một vòng lặp "for" đơn giản.

for child in root:
    print(child.tag, child.attrib)

genre {'category': 'Action'}
genre {'category': 'Thriller'}
genre {'category': 'Comedy'}

Bây giờ bạn biết rằng các phần tử con của gốc collection đều là genre. Để chỉ định thể loại, XML dùng thuộc tính category. Theo phần tử genre, có các phim Action, Thriller và Comedy.

Thông thường, việc biết tất cả các phần tử trong toàn bộ cây rất hữu ích. Một hàm hữu ích để làm điều đó là root.iter(). Bạn có thể đưa hàm này vào một vòng lặp "for" và nó sẽ lặp qua toàn bộ cây.

[elem.tag for elem in root.iter()]

['collection',
 'genre',
 'decade',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'decade',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'genre',
 'decade',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'decade',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'genre',
 'decade',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'decade',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'decade',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'decade',
 'movie',
 'format',
 'year',
 'rating',
 'description']

Điều này cho bạn cái nhìn tổng quan về số lượng phần tử bạn có, nhưng nó không hiển thị thuộc tính hay cấp độ trong cây.

Có một cách hữu ích để xem toàn bộ tài liệu. Bất kỳ phần tử nào cũng có phương thức .tostring(). Nếu bạn truyền gốc vào phương thức .tostring(), bạn có thể trả về toàn bộ tài liệu. Trong ElementTree (nhớ là đã đặt bí danh ET), .tostring() có dạng hơi khác thường.

Vì ElementTree là một thư viện mạnh có thể diễn giải nhiều thứ hơn chỉ XML. Bạn phải chỉ định cả mã hóa và giải mã của tài liệu bạn đang hiển thị dưới dạng chuỗi. Với XML, hãy dùng 'utf8' - Đây là định dạng tài liệu điển hình cho XML.

print(ET.tostring(root, encoding='utf8').decode('utf8'))

<?xml version='1.0' encoding='utf8'?>
<collection>
    <genre category="Action">
        <decade years="1980s">
            <movie favorite="True" title="Indiana Jones: The raiders of the lost Ark">
                <format multiple="No">DVD</format>
                <year>1981</year>
                <rating>PG</rating>
                <description>
                'Archaeologist and adventurer Indiana Jones 
                is hired by the U.S. government to find the Ark of the 
                Covenant before the Nazis.'
                </description>
            </movie>
               <movie favorite="True" title="THE KARATE KID">
               <format multiple="Yes">DVD,Online</format>
               <year>1984</year>
               <rating>PG</rating>
               <description>None provided.</description>
            </movie>
            <movie favorite="False" title="Back 2 the Future">
               <format multiple="False">Blu-ray</format>
               <year>1985</year>
               <rating>PG</rating>
               <description>Marty McFly</description>
            </movie>
        </decade>
        <decade years="1990s">
            <movie favorite="False" title="X-Men">
               <format multiple="Yes">dvd, digital</format>
               <year>2000</year>
               <rating>PG-13</rating>
               <description>Two mutants come to a private academy for their kind whose resident superhero team must 
               oppose a terrorist organization with similar powers.</description>
            </movie>
            <movie favorite="True" title="Batman Returns">
               <format multiple="No">VHS</format>
               <year>1992</year>
               <rating>PG13</rating>
               <description>NA.</description>
            </movie>
               <movie favorite="False" title="Reservoir Dogs">
               <format multiple="No">Online</format>
               <year>1992</year>
               <rating>R</rating>
               <description>WhAtEvER I Want!!!?!</description>
            </movie>
        </decade>    
    </genre>

    <genre category="Thriller">
        <decade years="1970s">
            <movie favorite="False" title="ALIEN">
                <format multiple="Yes">DVD</format>
                <year>1979</year>
                <rating>R</rating>
                <description>"""""""""</description>
            </movie>
        </decade>
        <decade years="1980s">
            <movie favorite="True" title="Ferris Bueller's Day Off">
                <format multiple="No">DVD</format>
                <year>1986</year>
                <rating>PG13</rating>
                <description>Funny movie about a funny guy</description>
            </movie>
            <movie favorite="FALSE" title="American Psycho">
                <format multiple="No">blue-ray</format>
                <year>2000</year>
                <rating>Unrated</rating>
                <description>psychopathic Bateman</description>
            </movie>
        </decade>
    </genre>

    <genre category="Comedy">
        <decade years="1960s">
            <movie favorite="False" title="Batman: The Movie">
                <format multiple="Yes">DVD,VHS</format>
                <year>1966</year>
                <rating>PG</rating>
                <description>What a joke!</description>
            </movie>
        </decade>
        <decade years="2010s">
            <movie favorite="True" title="Easy A">
                <format multiple="No">DVD</format>
                <year>2010</year>
                <rating>PG--13</rating>
                <description>Emma Stone = Hester Prynne</description>
            </movie>
            <movie favorite="True" title="Dinner for SCHMUCKS">
                <format multiple="Yes">DVD,digital,Netflix</format>
                <year>2011</year>
                <rating>Unrated</rating>
                <description>Tim (Rudd) is a rising executive
                 who “succeeds” in finding the perfect guest, 
                 IRS employee Barry (Carell), for his boss’ monthly event, 
                 a so-called “dinner for idiots,” which offers certain 
                 advantages to the exec who shows up with the biggest buffoon.
                 </description>
            </movie>
        </decade>
        <decade years="1980s">
            <movie favorite="False" title="Ghostbusters">
                <format multiple="No">Online,VHS</format>
                <year>1984</year>
                <rating>PG</rating>
                <description>Who ya gonna call?</description>
            </movie>
        </decade>
        <decade years="1990s">
            <movie favorite="True" title="Robin Hood: Prince of Thieves">
                <format multiple="No">Blu_Ray</format>
                <year>1991</year>
                <rating>Unknown</rating>
                <description>Robin Hood slaying</description>
            </movie>
        </decade>
    </genre>
</collection>

Bạn có thể mở rộng việc dùng hàm iter() để hỗ trợ tìm các phần tử cụ thể quan tâm. root.iter() sẽ liệt kê tất cả phần tử con dưới gốc khớp với phần tử chỉ định. Ở đây, bạn sẽ liệt kê tất cả các thuộc tính của phần tử movie trong cây:

for movie in root.iter('movie'):
    print(movie.attrib)

{'favorite': 'True', 'title': 'Indiana Jones: The raiders of the lost Ark'}
{'favorite': 'True', 'title': 'THE KARATE KID'}
{'favorite': 'False', 'title': 'Back 2 the Future'}
{'favorite': 'False', 'title': 'X-Men'}
{'favorite': 'True', 'title': 'Batman Returns'}
{'favorite': 'False', 'title': 'Reservoir Dogs'}
{'favorite': 'False', 'title': 'ALIEN'}
{'favorite': 'True', 'title': "Ferris Bueller's Day Off"}
{'favorite': 'FALSE', 'title': 'American Psycho'}
{'favorite': 'False', 'title': 'Batman: The Movie'}
{'favorite': 'True', 'title': 'Easy A'}
{'favorite': 'True', 'title': 'Dinner for SCHMUCKS'}
{'favorite': 'False', 'title': 'Ghostbusters'}
{'favorite': 'True', 'title': 'Robin Hood: Prince of Thieves'}

Bạn đã có thể thấy cách các movies được nhập theo nhiều kiểu khác nhau. Đừng lo về điều đó lúc này. Bạn sẽ có cơ hội sửa một trong các lỗi sau trong hướng dẫn.

Biểu thức XPath

Nhiều lần các phần tử sẽ không có thuộc tính, chúng chỉ có nội dung văn bản. Sử dụng thuộc tính .text, bạn có thể in nội dung này.

Bây giờ, hãy in tất cả mô tả của các phim.

for description in root.iter('description'):
    print(description.text)

                'Archaeologist and adventurer Indiana Jones 
                is hired by the U.S. government to find the Ark of the 
                Covenant before the Nazis.'

None provided.
Marty McFly
Two mutants come to a private academy for their kind whose resident superhero team must 
               oppose a terrorist organization with similar powers.
NA.
WhAtEvER I Want!!!?!
"""""""""
Funny movie about a funny guy
psychopathic Bateman
What a joke!
Emma Stone = Hester Prynne
Tim (Rudd) is a rising executive
                 who “succeeds” in finding the perfect guest, 
                 IRS employee Barry (Carell), for his boss’ monthly event, 
                 a so-called “dinner for idiots,” which offers certain 
                 advantages to the exec who shows up with the biggest buffoon.

Who ya gonna call?
Robin Hood slaying

In ra XML rất hữu ích, nhưng XPath là ngôn ngữ truy vấn dùng để tìm kiếm qua XML một cách nhanh chóng và dễ dàng. XPath là viết tắt của XML Path Language và, như tên gọi gợi ý, dùng cú pháp giống "đường dẫn" để xác định và điều hướng các node trong tài liệu XML.

Hiểu XPath rất quan trọng để quét và điền dữ liệu XML. ElementTree có hàm .findall() sẽ duyệt qua các phần tử con trực tiếp của phần tử được tham chiếu. Bạn có thể dùng các biểu thức XPath để chỉ định các tìm kiếm hữu ích hơn.

Ở đây, bạn sẽ tìm trong cây các phim phát hành năm 1992:

for movie in root.findall("./genre/decade/movie/[year='1992']"):
    print(movie.attrib)

{'favorite': 'True', 'title': 'Batman Returns'}
{'favorite': 'False', 'title': 'Reservoir Dogs'}

Hàm .findall() luôn bắt đầu ở phần tử được chỉ định. Kiểu hàm này cực kỳ mạnh cho việc "tìm và thay thế". Bạn thậm chí có thể tìm theo thuộc tính!

Bây giờ, chỉ in các phim có sẵn ở nhiều định dạng (một thuộc tính).

for movie in root.findall("./genre/decade/movie/format/[@multiple='Yes']"):
    print(movie.attrib)

{'multiple': 'Yes'}
{'multiple': 'Yes'}
{'multiple': 'Yes'}
{'multiple': 'Yes'}
{'multiple': 'Yes'}

Hãy động não vì sao, trong trường hợp này, câu lệnh in trả về các giá trị "Yes" của multiple. Nghĩ về cách định nghĩa vòng lặp "for". Bạn có thể viết lại vòng lặp này để in ra tiêu đề phim thay vì vậy không? Thử bên dưới:

Mẹo: dùng '...' trong XPath để trả về phần tử cha của phần tử hiện tại.

for movie in root.findall("./genre/decade/movie/format[@multiple='Yes']..."):
    print(movie.attrib)

{'favorite': 'True', 'title': 'THE KARATE KID'}
{'favorite': 'False', 'title': 'X-Men'}
{'favorite': 'False', 'title': 'ALIEN'}
{'favorite': 'False', 'title': 'Batman: The Movie'}
{'favorite': 'True', 'title': 'Dinner for SCHMUCKS'}

Chỉnh sửa một XML

Trước đó, tiêu đề phim rất lộn xộn. Giờ hãy in lại chúng:

for movie in root.iter('movie'):
    print(movie.attrib)

{'favorite': 'True', 'title': 'Indiana Jones: The raiders of the lost Ark'}
{'favorite': 'True', 'title': 'THE KARATE KID'}
{'favorite': 'False', 'title': 'Back 2 the Future'}
{'favorite': 'False', 'title': 'X-Men'}
{'favorite': 'True', 'title': 'Batman Returns'}
{'favorite': 'False', 'title': 'Reservoir Dogs'}
{'favorite': 'False', 'title': 'ALIEN'}
{'favorite': 'True', 'title': "Ferris Bueller's Day Off"}
{'favorite': 'FALSE', 'title': 'American Psycho'}
{'favorite': 'False', 'title': 'Batman: The Movie'}
{'favorite': 'True', 'title': 'Easy A'}
{'favorite': 'True', 'title': 'Dinner for SCHMUCKS'}
{'favorite': 'False', 'title': 'Ghostbusters'}
{'favorite': 'True', 'title': 'Robin Hood: Prince of Thieves'}

Sửa số '2' trong Back 2 the Future. Đây nên là một bài toán tìm-và-thay. Viết mã để tìm tiêu đề 'Back 2 the Future' và lưu nó vào một biến:

b2tf = root.find("./genre/decade/movie[@title='Back 2 the Future']")
print(b2tf)

<Element 'movie' at 0x10ce00ef8>

Lưu ý rằng dùng phương thức .find() trả về một phần tử của cây. Phần lớn thời gian, việc chỉnh sửa nội dung bên trong một phần tử sẽ hữu ích hơn.

Chỉnh sửa thuộc tính title của biến phần tử Back 2 the Future thành "Back to the Future". Sau đó, in ra các thuộc tính của biến để xem thay đổi. Bạn có thể làm điều này dễ dàng bằng cách truy cập thuộc tính của phần tử rồi gán giá trị mới cho nó:

b2tf.attrib["title"] = "Back to the Future"
print(b2tf.attrib)

{'favorite': 'False', 'title': 'Back to the Future'}

Ghi các thay đổi của bạn trở lại XML để chúng được sửa vĩnh viễn trong tài liệu. In lại thuộc tính phim của bạn để đảm bảo thay đổi đã hoạt động. Dùng phương thức .write() để thực hiện:

tree.write("movies.xml")

tree = ET.parse('movies.xml')
root = tree.getroot()

for movie in root.iter('movie'):
    print(movie.attrib)

{'favorite': 'True', 'title': 'Indiana Jones: The raiders of the lost Ark'}
{'favorite': 'True', 'title': 'THE KARATE KID'}
{'favorite': 'False', 'title': 'Back to the Future'}
{'favorite': 'False', 'title': 'X-Men'}
{'favorite': 'True', 'title': 'Batman Returns'}
{'favorite': 'False', 'title': 'Reservoir Dogs'}
{'favorite': 'False', 'title': 'ALIEN'}
{'favorite': 'True', 'title': "Ferris Bueller's Day Off"}
{'favorite': 'FALSE', 'title': 'American Psycho'}
{'favorite': 'False', 'title': 'Batman: The Movie'}
{'favorite': 'True', 'title': 'Easy A'}
{'favorite': 'True', 'title': 'Dinner for SCHMUCKS'}
{'favorite': 'False', 'title': 'Ghostbusters'}
{'favorite': 'True', 'title': 'Robin Hood: Prince of Thieves'}

Sửa Thuộc tính

Thuộc tính multiple không chính xác ở một số nơi. Dùng ElementTree để sửa nhãn dựa trên số lượng định dạng mà phim có. Trước hết, in thuộc tính và văn bản format để xem những phần cần sửa.

for form in root.findall("./genre/decade/movie/format"):
    print(form.attrib, form.text)

{'multiple': 'No'} DVD
{'multiple': 'Yes'} DVD,Online
{'multiple': 'False'} Blu-ray
{'multiple': 'Yes'} dvd, digital
{'multiple': 'No'} VHS
{'multiple': 'No'} Online
{'multiple': 'Yes'} DVD
{'multiple': 'No'} DVD
{'multiple': 'No'} blue-ray
{'multiple': 'Yes'} DVD,VHS
{'multiple': 'No'} DVD
{'multiple': 'Yes'} DVD,digital,Netflix
{'multiple': 'No'} Online,VHS
{'multiple': 'No'} Blu_Ray

Có một chút việc phải làm với thẻ này.

Bạn có thể dùng regex để tìm dấu phẩy - điều đó sẽ cho biết thuộc tính multiple nên là "Yes" hay "No." Thêm và sửa thuộc tính có thể làm dễ dàng với phương thức .set().

Lưu ý: re là bộ thông dịch regex tiêu chuẩn cho Python. Nếu bạn muốn biết thêm về biểu thức chính quy, hãy tham khảo hướng dẫn này.

import re

for form in root.findall("./genre/decade/movie/format"):
    # Search for the commas in the format text
    match = re.search(',',form.text)
    if match:
        form.set('multiple','Yes')
    else:
        form.set('multiple','No')

# Write out the tree to the file again
tree.write("movies.xml")

tree = ET.parse('movies.xml')
root = tree.getroot()

for form in root.findall("./genre/decade/movie/format"):
    print(form.attrib, form.text)

{'multiple': 'No'} DVD
{'multiple': 'Yes'} DVD,Online
{'multiple': 'No'} Blu-ray
{'multiple': 'Yes'} dvd, digital
{'multiple': 'No'} VHS
{'multiple': 'No'} Online
{'multiple': 'No'} DVD
{'multiple': 'No'} DVD
{'multiple': 'No'} blue-ray
{'multiple': 'Yes'} DVD,VHS
{'multiple': 'No'} DVD
{'multiple': 'Yes'} DVD,digital,Netflix
{'multiple': 'Yes'} Online,VHS
{'multiple': 'No'} Blu_Ray

Di chuyển Phần tử

Một số dữ liệu đã bị đặt vào sai thập niên. Hãy dùng những gì bạn đã học về XML và ElementTree để tìm và sửa các lỗi dữ liệu về thập niên.

Sẽ hữu ích nếu in ra cả thẻ decade và thẻ year trên toàn bộ tài liệu.

for decade in root.findall("./genre/decade"):
    print(decade.attrib)
    for year in decade.findall("./movie/year"):
        print(year.text, '\n')

{'years': '1980s'}
1981 

1984 

1985 

{'years': '1990s'}
2000 

1992 

1992 

{'years': '1970s'}
1979 

{'years': '1980s'}
1986 

2000 

{'years': '1960s'}
1966 

{'years': '2010s'}
2010 

2011 

{'years': '1980s'}
1984 

{'years': '1990s'}
1991

Hai năm nằm sai thập niên là các phim từ những năm 2000. Xác định đó là những phim nào bằng cách dùng một biểu thức XPath.

for movie in root.findall("./genre/decade/movie/[year='2000']"):
    print(movie.attrib)

{'favorite': 'False', 'title': 'X-Men'}
{'favorite': 'FALSE', 'title': 'American Psycho'}

Bạn phải thêm một thẻ thập niên mới, những năm 2000, vào thể loại Action để chuyển dữ liệu X-Men. Phương thức .SubElement() có thể được dùng để thêm thẻ này vào cuối XML.

action = root.find("./genre[@category='Action']")
new_dec = ET.SubElement(action, 'decade')
new_dec.attrib["years"] = '2000s'

print(ET.tostring(action, encoding='utf8').decode('utf8'))

<?xml version='1.0' encoding='utf8'?>
<genre category="Action">
        <decade years="1980s">
            <movie favorite="True" title="Indiana Jones: The raiders of the lost Ark">
                <format multiple="No">DVD</format>
                <year>1981</year>
                <rating>PG</rating>
                <description>
                'Archaeologist and adventurer Indiana Jones 
                is hired by the U.S. government to find the Ark of the 
                Covenant before the Nazis.'
                </description>
            </movie>
               <movie favorite="True" title="THE KARATE KID">
               <format multiple="Yes">DVD,Online</format>
               <year>1984</year>
               <rating>PG</rating>
               <description>None provided.</description>
            </movie>
            <movie favorite="False" title="Back to the Future">
               <format multiple="No">Blu-ray</format>
               <year>1985</year>
               <rating>PG</rating>
               <description>Marty McFly</description>
            </movie>
        </decade>
        <decade years="1990s">
            <movie favorite="False" title="X-Men">
               <format multiple="Yes">dvd, digital</format>
               <year>2000</year>
               <rating>PG-13</rating>
               <description>Two mutants come to a private academy for their kind whose resident superhero team must 
               oppose a terrorist organization with similar powers.</description>
            </movie>
            <movie favorite="True" title="Batman Returns">
               <format multiple="No">VHS</format>
               <year>1992</year>
               <rating>PG13</rating>
               <description>NA.</description>
            </movie>
               <movie favorite="False" title="Reservoir Dogs">
               <format multiple="No">Online</format>
               <year>1992</year>
               <rating>R</rating>
               <description>WhAtEvER I Want!!!?!</description>
            </movie>
        </decade>    
    <decade years="2000s" /></genre>

Bây giờ hãy thêm phim X-Men vào những năm 2000 và loại nó khỏi những năm 1990, lần lượt dùng .append() và .remove().

xmen = root.find("./genre/decade/movie[@title='X-Men']")
dec2000s = root.find("./genre[@category='Action']/decade[@years='2000s']")
dec2000s.append(xmen)
dec1990s = root.find("./genre[@category='Action']/decade[@years='1990s']")
dec1990s.remove(xmen)

print(ET.tostring(action, encoding='utf8').decode('utf8'))

<?xml version='1.0' encoding='utf8'?>
<genre category="Action">
        <decade years="1980s">
            <movie favorite="True" title="Indiana Jones: The raiders of the lost Ark">
                <format multiple="No">DVD</format>
                <year>1981</year>
                <rating>PG</rating>
                <description>
                'Archaeologist and adventurer Indiana Jones 
                is hired by the U.S. government to find the Ark of the 
                Covenant before the Nazis.'
                </description>
            </movie>
               <movie favorite="True" title="THE KARATE KID">
               <format multiple="Yes">DVD,Online</format>
               <year>1984</year>
               <rating>PG</rating>
               <description>None provided.</description>
            </movie>
            <movie favorite="False" title="Back to the Future">
               <format multiple="No">Blu-ray</format>
               <year>1985</year>
               <rating>PG</rating>
               <description>Marty McFly</description>
            </movie>
        </decade>
        <decade years="1990s">
            <movie favorite="True" title="Batman Returns">
               <format multiple="No">VHS</format>
               <year>1992</year>
               <rating>PG13</rating>
               <description>NA.</description>
            </movie>
               <movie favorite="False" title="Reservoir Dogs">
               <format multiple="No">Online</format>
               <year>1992</year>
               <rating>R</rating>
               <description>WhAtEvER I Want!!!?!</description>
            </movie>
        </decade>    
    <decade years="2000s"><movie favorite="False" title="X-Men">
               <format multiple="Yes">dvd, digital</format>
               <year>2000</year>
               <rating>PG-13</rating>
               <description>Two mutants come to a private academy for their kind whose resident superhero team must 
               oppose a terrorist organization with similar powers.</description>
            </movie>
            </decade></genre>

Xây dựng Tài liệu XML

Tuyệt, bạn đã có thể về cơ bản di chuyển toàn bộ một phim sang thập niên mới. Hãy lưu các thay đổi của bạn lại tệp XML.

tree.write("movies.xml")

tree = ET.parse('movies.xml')
root = tree.getroot()

print(ET.tostring(root, encoding='utf8').decode('utf8'))

<?xml version='1.0' encoding='utf8'?>
<collection>
    <genre category="Action">
        <decade years="1980s">
            <movie favorite="True" title="Indiana Jones: The raiders of the lost Ark">
                <format multiple="No">DVD</format>
                <year>1981</year>
                <rating>PG</rating>
                <description>
                'Archaeologist and adventurer Indiana Jones 
                is hired by the U.S. government to find the Ark of the 
                Covenant before the Nazis.'
                </description>
            </movie>
               <movie favorite="True" title="THE KARATE KID">
               <format multiple="Yes">DVD,Online</format>
               <year>1984</year>
               <rating>PG</rating>
               <description>None provided.</description>
            </movie>
            <movie favorite="False" title="Back to the Future">
               <format multiple="No">Blu-ray</format>
               <year>1985</year>
               <rating>PG</rating>
               <description>Marty McFly</description>
            </movie>
        </decade>
        <decade years="1990s">
            <movie favorite="True" title="Batman Returns">
               <format multiple="No">VHS</format>
               <year>1992</year>
               <rating>PG13</rating>
               <description>NA.</description>
            </movie>
               <movie favorite="False" title="Reservoir Dogs">
               <format multiple="No">Online</format>
               <year>1992</year>
               <rating>R</rating>
               <description>WhAtEvER I Want!!!?!</description>
            </movie>
        </decade>    
    <decade years="2000s"><movie favorite="False" title="X-Men">
               <format multiple="Yes">dvd, digital</format>
               <year>2000</year>
               <rating>PG-13</rating>
               <description>Two mutants come to a private academy for their kind whose resident superhero team must 
               oppose a terrorist organization with similar powers.</description>
            </movie>
            </decade></genre>

    <genre category="Thriller">
        <decade years="1970s">
            <movie favorite="False" title="ALIEN">
                <format multiple="No">DVD</format>
                <year>1979</year>
                <rating>R</rating>
                <description>"""""""""</description>
            </movie>
        </decade>
        <decade years="1980s">
            <movie favorite="True" title="Ferris Bueller's Day Off">
                <format multiple="No">DVD</format>
                <year>1986</year>
                <rating>PG13</rating>
                <description>Funny movie about a funny guy</description>
            </movie>
            <movie favorite="FALSE" title="American Psycho">
                <format multiple="No">blue-ray</format>
                <year>2000</year>
                <rating>Unrated</rating>
                <description>psychopathic Bateman</description>
            </movie>
        </decade>
    </genre>

    <genre category="Comedy">
        <decade years="1960s">
            <movie favorite="False" title="Batman: The Movie">
                <format multiple="Yes">DVD,VHS</format>
                <year>1966</year>
                <rating>PG</rating>
                <description>What a joke!</description>
            </movie>
        </decade>
        <decade years="2010s">
            <movie favorite="True" title="Easy A">
                <format multiple="No">DVD</format>
                <year>2010</year>
                <rating>PG--13</rating>
                <description>Emma Stone = Hester Prynne</description>
            </movie>
            <movie favorite="True" title="Dinner for SCHMUCKS">
                <format multiple="Yes">DVD,digital,Netflix</format>
                <year>2011</year>
                <rating>Unrated</rating>
                <description>Tim (Rudd) is a rising executive
                 who “succeeds” in finding the perfect guest, 
                 IRS employee Barry (Carell), for his boss’ monthly event, 
                 a so-called “dinner for idiots,” which offers certain 
                 advantages to the exec who shows up with the biggest buffoon.
                 </description>
            </movie>
        </decade>
        <decade years="1980s">
            <movie favorite="False" title="Ghostbusters">
                <format multiple="Yes">Online,VHS</format>
                <year>1984</year>
                <rating>PG</rating>
                <description>Who ya gonna call?</description>
            </movie>
        </decade>
        <decade years="1990s">
            <movie favorite="True" title="Robin Hood: Prince of Thieves">
                <format multiple="No">Blu_Ray</format>
                <year>1991</year>
                <rating>Unknown</rating>
                <description>Robin Hood slaying</description>
            </movie>
        </decade>
    </genre>
</collection>

ElementTree có gì mới?

Dưới đây là tổng quan về các tính năng mới và cải tiến của thư viện ElementTree trong các phiên bản Python mới hơn:

1. Hỗ trợ XPath 1.0 (Python 3.8): Bắt đầu từ Python 3.8, ElementTree bao gồm hỗ trợ đầy đủ XPath 1.0 với các phương thức find() và findall(), cho phép các truy vấn XML phong phú và phức tạp hơn. Ví dụ:

# Finding all movies with a specific attribute using XPath
for movie in root.findall(".//movie[@favorite='True']"):
    print(movie.attrib)

2. Cải thiện namespaces (Python 3.8+): Tăng cường hỗ trợ cho XML namespace, cho phép tương tác đơn giản hơn với các tệp XML sử dụng namespace có tiền tố hoặc mặc định. Ví dụ:

# Register a namespace and find elements using it
ET.register_namespace('', 'http://example.com/namespace')
movies = root.findall(".//{http://example.com/namespace}movie")

3. Cải tiến bộ phân tích cú pháp (Python 3.9): Thông báo lỗi phân tích tốt hơn giúp việc gỡ lỗi các tệp XML không hợp lệ dễ dàng hơn.

4. Hàm indent() mới (Python 3.9): Hàm xml.etree.ElementTree.indent() được thêm vào để in đẹp (pretty-print) tài liệu XML bằng cách thụt dòng các phần tử. Ví dụ:

ET.indent(root, space="  ", level=0)
ET.dump(root)

5. Phân tích hiệu quả với iterparse (Python 3.10): Tối ưu cho hiệu quả bộ nhớ, đặc biệt hữu ích khi làm việc với các tệp XML lớn.

6. Tài liệu mở rộng (Cập nhật liên tục): Tài liệu Python cho ElementTree hiện toàn diện hơn, bao gồm thực tiễn tốt nhất và các trường hợp sử dụng nâng cao.

Các tính năng ElementTree bị phản đối (deprecated) và thay thế

1. write() với xml_declaration trong Python 3.8+: Tham số xml_declaration của phương thức write() bị phản đối khi mã hóa được đặt là 'unicode'.

Thay thế: Chỉ dùng xml_declaration khi mã hóa được xác định rõ là khác 'unicode'.

tree.write("output.xml", encoding="utf-8", xml_declaration=True)

2. html parser: Mặc dù không bị phản đối chính thức, việc dùng ElementTree để phân tích HTML không được khuyến khích vì nó hạn chế trong việc xử lý HTML không đúng chuẩn.

Thay thế: Dùng các thư viện được thiết kế riêng cho việc phân tích HTML, như BeautifulSoup từ gói bs4.

from bs4 import BeautifulSoup soup = BeautifulSoup(html_content, 'html.parser')

3. Các cách làm tạm với namespace: Các phương pháp cũ để xử lý namespace thủ công (ví dụ: nối URI namespace với tên thẻ) ít được khuyến nghị hơn khi đã có hỗ trợ namespace mạnh mẽ trong các phiên bản mới.

Thay thế: Dùng các phương thức và hàm tích hợp có nhận biết namespace.

ET.register_namespace('', 'http://example.com/namespace') movies = root.findall(".//{http://example.com/namespace}movie")

4. In đẹp thủ công: Các kỹ thuật thủ công để thụt dòng và định dạng XML đã lỗi thời nhờ hàm indent() mới (Python 3.9).

Thay thế: Dùng ET.indent() để định dạng XML tự động.

ET.indent(root, space=" ")

5. Dùng trực tiếp _ElementInterface: Các lớp nội bộ như _ElementInterface không được thiết kế để dùng trực tiếp và có thể hỏng trong các phiên bản tương lai.

Thay thế: Luôn tương tác với API công khai đã được tài liệu hóa của thư viện ElementTree.

Kết luận

Có một số điều quan trọng cần nhớ về XML và việc dùng ElementTree.

Các thẻ xây dựng cấu trúc cây và chỉ định những giá trị nên được phân tách ở đó. Sử dụng cấu trúc thông minh có thể giúp việc đọc và ghi XML dễ dàng. Thẻ luôn cần dấu mở và đóng để thể hiện mối quan hệ cha và con.

Các thuộc tính mô tả thêm cách xác thực một thẻ hoặc cho phép chỉ định kiểu boolean. Thuộc tính thường nhận các giá trị rất cụ thể để trình phân tích XML (và người dùng) có thể dùng thuộc tính để kiểm tra giá trị thẻ.

ElementTree là một thư viện Python quan trọng cho phép bạn phân tích và điều hướng tài liệu XML. Dùng ElementTree phân rã tài liệu XML thành cấu trúc cây dễ làm việc. Khi phân vân, hãy in toàn bộ (print(ET.tostring(root, encoding='utf8').decode('utf8'))) - dùng câu lệnh in hữu ích này để xem toàn bộ tài liệu XML cùng lúc. Nó giúp kiểm tra khi chỉnh sửa, thêm hoặc xóa khỏi XML.

Giờ đây, bạn đã sẵn sàng để hiểu XML và bắt đầu phân tích!