python之list.count()和Count()类

Count()类的使用
同事给了我一堆文本数据，让我帮个小忙。他想统计下每个词的词频，看看文本中提到最多的是什么，然后做后面分析。

不就是统计词频吗，虽然之前不经常做这个。但是拍脑袋一想，先分词，去停用词，把所有词放到一个列表里，统计，搞定。

于是五分钟写了个代码，计数那个地方，我用的List里的count方法。不怕丢人，我把代码放这了。。。

count_word = []
stop_words = []
with open('data/count.txt', 'r', encoding='utf-8') as f:
    lines = f.readlines()
with open('data/stop_words_new.txt', 'r', encoding='utf-8') as f:
    stop_lines = f.readlines()
for word in stop_lines:
    stop_words.append(word.strip())
for line in lines:
    words = jieba.lcut(line)
    for i in words:
        if i not in stop_words:
            count_word.append(i)
count_word_set = list(set(count_word))
counts = []
for i in count_word_set:
    counts.append(count_word.count(i))
print(len(count_word_set))
print(len(counts))
df = pd.DataFrame(data={'word': count_word_set, 'count': counts})
df.sort_values(by='count')
df.to_csv('data/count.txt', sep='\t', encoding='utf-8')

然而，等了一晚上，发现没出来结果。他说数据量不大，几百万条文本。我这里用的set去重，然后遍历set中的每个词，再用list.count(词)来得词频，笨方法行不通。

所以今天试了Count()类。
import jieba
import pandas as pd
from collections import Counter
import pickle

count_word = []
count = Counter()
stop_words = []
with open('data/count.txt', 'r', encoding='utf-8') as f:
    lines = f.readlines()
with open('data/stop_words_new.txt', 'r', encoding='utf-8') as f:
    stop_lines = f.readlines()
print('data finish')
for word in stop_lines:
    stop_words.append(word.strip())
for line in lines:
    words = jieba.lcut(line)
    for i in words:
        if i not in stop_words:
            count_word.append(i)
print('cut finish')
for word in count_word:
    count[word] += 1
cou = count.most_common()
print(cou)
with open('data/common.txt', 'wb') as f:
    pickle.dump(cou, f)

分完词就结束了。。如果只是想看某个词的词频，用list的count()方法还好，如果统计所有的，还是用Count()类吧。

山风之行

—— 修养风范血性胆识

python之list.count()和Count()类