NLP

發佈日期: 2021-02-04

瀏覽次數:

How to get the wikipedia corpus text with digits by using gensim wikicorpus?

在本範例你會學到：

訓練 word2vec 時可以保留數字

在本範例你需要先準備好：

大致上熟悉 word2vec 訓練流程
知曉如何從 wikipedia dump 檔案
版本
- python 3.6.8
- gensim==3.8.3

訓練 word2vec model(後面簡稱w2v) 時的中文語料第一選擇通常就是 Wikipedia 的語料庫(最後一次查看有3651160篇文章)，而 w2v 本身又支援直接處理 wiki 的 bz2檔案產出對應格式的中文，不過他會先做一些預處理例如：移除標點符號、所有的數字，但我又想要留下這些東西怎麼辦呢？像是年份(ex:1992)、品牌型號(iphone12)、專有名詞(4G, 5G)，word2vec 在預設使用 wiki 當語料庫情況下是會去除數字的，我們來繼續看下去！

1.深入了解 wikicorpus

大家如果照著網路上的教學，通常也寫得大同小異，可能也不會特別注意模型到底是在哪個階段把東西濾掉的，這裡就來拆解一下整個流程讓大家了解一下：

預設情況下，讀者有訓練過 w2v 的話，一定會看過這行：

wiki_corpus = WikiCorpus(your_dump_bz2_file, dictionary={})

這段就是把 dump 下來的 wiki 壓縮檔直接餵給 WikiCorpus 這個 class 來進行預處理，他會把文章處理成這個樣子：

原來: 今天天氣 , 不好 2021 星期一氣溫 15度 , 下雨 , 機率 100%
處理後：今天天氣不好星期一氣溫度下雨機率

很明顯的上方例子的數字與標點符號都被消失了！

查看了官方文件後很明顯的沒有合適可以使用的參數，只有轉小寫這件事情可以直接用參數調整，不然我也不會發這篇了

https://radimrehurek.com/gensim/corpora/wikicorpus.html#gensim.corpora.wikicorpus.WikiCorpus

查詢官方的 source code 之後，發現他調整的地方是 WikiCorpus 裡面中會調用 utils.py 這隻檔案中的 utils.tokenize，裡面有毀屍滅跡的 regex 語法。接下來稍微說明一下他怎麼運作的呢？

2.深入了解 utils.py

這裡面其他的 func 我們就別理他了，專注於他處理 regex 部分即可，我們現在就先簡單的直接調用，看看會發生什麼事：

原始的樣子

from gensim import utils
content = '今天 天氣 , 不好 2021 星期一 氣溫 15度 , 下雨 , 機率 100%'
[token.encode('utf8') for token in utils.tokenize(content, lower=True, errors='ignore')
            if len(token) <= 15 and not token.startswith('_')]

輸出，可以看得出來數字標點都不見了:

[b'\xe4\xbb\x8a\xe5\xa4\xa9', b'\xe5\xa4\xa9\xe6\xb0\xa3', b'\xe4\xb8\x8d\xe5\xa5\xbd', b'\xe6\x98\x9f\xe6\x9c\x9f\xe4\xb8\x80', b'\xe6\xb0\xa3\xe6\xba\xab', b'\xe5\xba\xa6', b'\xe4\xb8\x8b\xe9\x9b\xa8', b'\xe6\xa9\x9f\xe7\x8e\x87']

方法一

改寫了一下，不使用 utils.tokenize

from gensim import utils
content = '今天 天氣 , 不好 2021 星期一 氣溫 15度 , 下雨 , 機率 100%'
[token.encode('utf8') for token in content.split() 
           if len(token) <= 15 and not token.startswith('_')]

輸出，什麼都留下來了，但什麼都要自己處理有點煩，不採用，但是確定是這裡的問題:

[b'\xe4\xbb\x8a\xe5\xa4\xa9', b'\xe5\xa4\xa9\xe6\xb0\xa3', b',', b'\xe4\xb8\x8d\xe5\xa5\xbd', b'2021', b'\xe6\x98\x9f\xe6\x9c\x9f\xe4\xb8\x80', b'\xe6\xb0\xa3\xe6\xba\xab', b'15\xe5\xba\xa6', b',', b'\xe4\xb8\x8b\xe9\x9b\xa8', b',', b'\xe6\xa9\x9f\xe7\x8e\x87', b'100%']

方法二

直接改寫utils.py，大家可以在 package 的路徑下找到這份檔案

/anaconda/lib/python3.6/site-packages/gensim/utils.py

打開這份檔案可以發現，裡面有個 tokenize(我把註解拿掉了篇幅太長)，很明顯地在文章被轉成小寫之後，執行了 PAT_ALPHABETIC.finditer(text)，搜索了一下 PAT_ALPHABETIC 才終於發現這個萬惡之源！

def tokenize(text, lowercase=False, deacc=False, errors="strict", to_lower=False, lower=False):

    lowercase = lowercase or to_lower or lower
    text = to_unicode(text, errors=errors)
    if lowercase:
        text = text.lower()
    if deacc:
        text = deaccent(text)
    for match in PAT_ALPHABETIC.finditer(text):
        yield match.group()

預設的 re 是這樣寫的：

PAT_ALPHABETIC = re.compile('(((?![\d])\w)+)', re.UNICODE)

裡面的(?![\d])會把數字都過濾掉，所以我就把他改寫成這樣，大家可以依需求調整：

PAT_ALPHABETIC = re.compile('((()\w)+)', re.UNICODE)

我們用最廢的範例還實作一次，請忽略 code 很醜：

import re
import utils
content = '今天 天氣 , 不好 2021 星期一 氣溫 15度 , 下雨 , 機率 100%'
PAT_ALPHABETIC = re.compile('((()\w)+)', re.UNICODE)
def simple_tokenize(content):
    for match in PAT_ALPHABETIC.finditer(content):
        yield match.group()

text = simple_tokenize(content)

result = ''
for i in text:
    result += i + ' '

print(result)

輸出，是我想像的樣子了:

今天 天氣 不好 2021 星期一 氣溫 15度 下雨 機率 100

不過這是我們自己的小測試，該如何應用在真實訓練上呢？我們繼續看下去～

3.改寫與繼承

大致上了解他的用法之後，我採用的方式是直接複製一份 utils.py 到我的專案目錄下面，並改寫：

PAT_ALPHABETIC = re.compile('(((?![\d])\w)+)', re.UNICODE)

變成

PAT_ALPHABETIC = re.compile('((()\w)+)', re.UNICODE)

而在主要調用的地方，自己創一個新的 class來繼承原來的 WikiCorpus，並改寫兩個 func 分別為 tokenize 與 process_article 就可以直接使用了喔！！以下的範例基本上跟原始 source 是一樣的，不用修改什麼：

記得在 import 自己的 utils.py 時要放在 gensim 後面喔！

import sys
from gensim.corpora import *
import os
from gensim.corpora.wikicorpus import *
import utils # must import after gensim package

def tokenize(content):
    """
    Tokenize a piece of text from wikipedia. The input string `content` is assumed
    to be mark-up free (see `filter_wiki()`).

    Return list of tokens as utf8 bytestrings. Ignore words shorted than 2 or longer
    that 15 characters (not bytes!).
    """
    # TODO maybe ignore tokens with non-latin characters? (no chinese, arabic, russian etc.)
    return [
        utils.to_unicode(token) for token in utils.tokenize(content, lower=True, errors='ignore')
        if 2 <= len(token) <= 15 and not token.startswith('_')
    ]

def process_article(args):
    """
    Parse a wikipedia article, returning its content as a list of tokens
    (utf8-encoded strings).
    """
    text, lemmatize, title, pageid = args
    text = filter_wiki(text)
    if lemmatize:
        result = utils.lemmatize(text)
    else:
        result = tokenize(text)
    return result, title, pageid

class MyWikiCorpus(WikiCorpus):
    def __init__(self, fname, processes=None, lemmatize=utils.has_pattern(), dictionary=None, filter_namespaces=('0',)):
        WikiCorpus.__init__(self, fname, processes, lemmatize, dictionary, filter_namespaces)
    def get_texts(self):
        articles, articles_all = 0, 0
        positions, positions_all = 0, 0
        texts = ((text, self.lemmatize, title, pageid) for title, text, pageid in extract_pages(bz2.BZ2File(self.fname), self.filter_namespaces))
        pool = multiprocessing.Pool(self.processes)
        # process the corpus in smaller chunks of docs, because multiprocessing.Pool
        # is dumb and would load the entire input into RAM at once...
        for group in utils.chunkize(texts, chunksize=10 * self.processes, maxsize=1):
            for tokens, title, pageid in pool.imap(process_article, group):  # chunksize=10):
                articles_all += 1
                positions_all += len(tokens)
                # article redirects and short stubs are pruned here
                if len(tokens) < ARTICLE_MIN_WORDS or any(title.startswith(ignore + ':') for ignore in IGNORED_NAMESPACES):
                    continue
                articles += 1
                positions += len(tokens)
                if self.metadata:
                    yield (tokens, (pageid, title))
                else:
                    yield tokens
        pool.terminate()
        logger.info(
            "finished iterating over Wikipedia corpus of %i documents with %i positions"
            " (total %i articles, %i positions before pruning articles shorter than %i words)",
            articles, positions, articles_all, positions_all, ARTICLE_MIN_WORDS)
        self.length = articles  # cache corpus length

要修改的地方就是一開始使用的地方，要將：

wiki_corpus = WikiCorpus(your_dump_bz2_file, dictionary={})

改成 MyWikiCorpus 就可以囉：

wiki_corpus = MyWikiCorpus(your_dump_bz2_file, dictionary={})

參考資料

以下參考資料並沒有直接說明如何保留數字，都是說明怎麼保留標點符號，要保留數字可以用我上面改寫 utils.py 的方式！

gensim 套件說明

https://radimrehurek.com/gensim/corpora/wikicorpus.html#gensim.corpora.wikicorpus.WikiCorpus

解決標點符號被移除的問題(數字依然會被移掉)

https://stackoverflow.com/questions/50697092/how-to-get-the-wikipedia-corpus-text-with-punctuation-by-using-gensim-wikicorpus

完整範例程式參考(數字依然會被移掉)

https://github.com/RaRe-Technologies/gensim/issues/552#issuecomment-278036501

若有任何問題與指教歡迎與我聯繫，若覺得我的內容不錯麻煩幫我隨便點個廣告，謝謝。

轉載與引用請註明作者: Happy Coding Lab NLP 系列- 如何在word2vec訓練時讓WikiCorpus保留數字(digit)?

本篇

NLP 系列- 如何在word2vec訓練時讓WikiCorpus保留數字(digit)?

How to get the wikipedia corpus text with digits by using gensim wikicorpus?在本範例你會學到：訓練 word2vec 時可以保留數字在本範例你需要先準備好：

2021-02-04 NLP

wikicorpus NLP word2vec gensim

Django 系列- 如何在{{value}}中完整移除html tag?

How to completely remove the tags before truncating?在本範例你會學到：部落格或網誌的文章清單中簡短顯示內容(點擊看上集) Django filters(過濾器) 的使用 stripta

2021-01-13 Django

django truncatechars striptags