python之list.count()和Count()类

Count()类的使用
同事给了我一堆文本数据,让我帮个小忙。他想统计下每个词的词频,看看文本中提到最多的是什么,然后做后面分析。

不就是统计词频吗,虽然之前不经常做这个。但是拍脑袋一想,先分词,去停用词,把所有词放到一个列表里,统计,搞定。

于是五分钟写了个代码,计数那个地方,我用的List里的count方法。不怕丢人,我把代码放这了。。。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
count_word = []
stop_words = []
with open('data/count.txt', 'r', encoding='utf-8') as f:
lines = f.readlines()
with open('data/stop_words_new.txt', 'r', encoding='utf-8') as f:
stop_lines = f.readlines()
for word in stop_lines:
stop_words.append(word.strip())
for line in lines:
words = jieba.lcut(line)
for i in words:
if i not in stop_words:
count_word.append(i)
count_word_set = list(set(count_word))
counts = []
for i in count_word_set:
counts.append(count_word.count(i))
print(len(count_word_set))
print(len(counts))
df = pd.DataFrame(data={'word': count_word_set, 'count': counts})
df.sort_values(by='count')
df.to_csv('data/count.txt', sep='\t', encoding='utf-8')

然而,等了一晚上,发现没出来结果。他说数据量不大,几百万条文本。我这里用的set去重,然后遍历set中的每个词,再用list.count(词)来得词频,笨方法行不通。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
所以今天试了Count()类。
import jieba
import pandas as pd
from collections import Counter
import pickle

count_word = []
count = Counter()
stop_words = []
with open('data/count.txt', 'r', encoding='utf-8') as f:
lines = f.readlines()
with open('data/stop_words_new.txt', 'r', encoding='utf-8') as f:
stop_lines = f.readlines()
print('data finish')
for word in stop_lines:
stop_words.append(word.strip())
for line in lines:
words = jieba.lcut(line)
for i in words:
if i not in stop_words:
count_word.append(i)
print('cut finish')
for word in count_word:
count[word] += 1
cou = count.most_common()
print(cou)
with open('data/common.txt', 'wb') as f:
pickle.dump(cou, f)

分完词就结束了。。如果只是想看某个词的词频,用list的count()方法还好,如果统计所有的,还是用Count()类吧。