How to remove/parse utm tags from urls to get clean url in python?

在本範例你會學到：

如何使用 python 得到乾淨的 url
使用 urllib.parse 移除不想要的 tag/參數/query

在本範例你需要先準備好：

本環境測試使用 python3.6 以上(含)

1.何謂 UTM

在處理跟網址相關的程式的時候大家應該都會有一個困擾，明明就都會導到一樣的網址，但是網址卻都不一樣，因為後面帶了各式各樣的來源追蹤碼追蹤網站流量，最常見的就是 Google GA 的UTM 以及 facebook 的追蹤 tag 了，這樣就會導致計算流量時重複計算的困擾:

https://www.google.com/?utm_source=123&utm_medium=456&fbclid=789#fb_comment_id=0987

https://www.google.com/?utm_source=123&utm_medium=456&fbclid=789

https://www.google.com/?utm_source=123&utm_medium=456

上述的三個例子，都會導向同一個頁面，但是由於網址不同，在計算流量時就會出問題！！

這個標記能幹嘛呢？

Google Analytics 提供給分析人員可以自訂寫入的連結標

簡單來說就是可以追蹤使用這從何而來，從哪個頁面、按鈕、甚至是廣告，只要使用者點擊了某個被我安插 UTM tag 的按鈕或連結，這個網址就會把相關資訊帶回後台，我就可以很簡單的分流，我的網站流量是來自自然搜尋、email、廣告活動頁面…等等的來源。

常見的 tag 有 utm_source、utm_medium、utm_campaign，簡單相關常識可以到參考資料看更多喔！

2.urllib.parse

我們可以利用 Python 內建的 urllib.parse 模組解析URL中的參數(有些原文稱為 query 或是 tag)

簡單的 import

from urllib.parse import urlparse, urlunparse, parse_qs, urlencode

使用 urlparse(url) 解析網址

可以看到他將網址切分成了不同的部分，scheme、netloc、path、params、query、fragment，也可以看得出來我們真正需要的網址是在前半部，後面的params、query、fragment都是要移除掉的部分。

# 將剛剛的範例放進去

url = 'https://www.google.com/?utm_source=123&utm_medium=456&fbclid=789#fb_comment_id=0987'
url_component = urlparse(url)

# 結果
>>> ParseResult(scheme='https', netloc='www.google.com', path='/', params='', query='utm_source=123&utm_medium=456&fbclid=789', fragment='fb_comment_id=0987')

定義想要移除的 tag、query 或關鍵字

stripKeys=["utm_source","utm_medium","utm_campaign","utm_term","utm_content","fbclid","fb_comment_id","ref","fb_action_ids","fb_action_types","fb_source","fb_ref","fb_node","action_object_map","action_type_map","action_ref_map","from","_fstview","openExternalBrowser","_gl","time","utm_compaign","gclid","ct","redirect"]

使用 parse_qs 移除想要移除的參數

示範只移除 query 的部分 (url_component.query)

query = parse_qs(url_component.query, keep_blank_values=True)

for keys in stripKeys:
    query.pop(keys, None)

# 此時 query 為空
>>> query
{}

最後將原本的 query 取代掉

url_component = url_component._replace(query=urlencode(query, True))
print(urlunparse(url_component))

最後結果

https://www.google.com/#fb_comment_id=0987

可以明顯地看到只剩下 FB 的 fragment 還在網址後面(這部分就交給大家自己練習)，UTM 的部分都已經被移除囉！是不是蠻容易的呢？

3.完整範例


from urllib.parse import urlparse, urlunparse, parse_qs, urlencode

url = 'https://www.google.com/?utm_source=123&utm_medium=456&fbclid=789#fb_comment_id=0987'

url_component = urlparse(url)

stripKeys=["utm_source","utm_medium","utm_campaign","utm_term","utm_content","fbclid","fb_comment_id","ref","fb_action_ids","fb_action_types","fb_source","fb_ref","fb_node","action_object_map","action_type_map","action_ref_map","from","_fstview","openExternalBrowser","_gl","time","utm_compaign","gclid","ct","redirect"]

query = parse_qs(url_component.query, keep_blank_values=True)

for keys in stripKeys:
    query.pop(keys, None)

url_component = url_component._replace(query=urlencode(query, True))
print(urlunparse(url_component))