lfsm commited on
Commit
3071277
·
1 Parent(s): d4d011f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -4,7 +4,7 @@ this is ja cc filter for reference from ja wiki vs random ja mc4, and build with
4
  2. crawl 300K of 4M webpages from the urls
5
  3. get pure text and remove content len less than 1k,
6
  4. use langdetect to tell the lang of the pages,
7
- we finally get total **160K**pages : **101K** ja pages, **47K** en pages, and **12K** other lang pages
8
  5. random sample 16K from ja mc4, concat with all 16k pages to get lang_all.txt data
9
  6. random sample 10K from ja mc4, concat with ja 10k pages to get lang_ja.txt data
10
  7. tokenize all text with "cl-tohoku/bert-base-japanese"
 
4
  2. crawl 300K of 4M webpages from the urls
5
  3. get pure text and remove content len less than 1k,
6
  4. use langdetect to tell the lang of the pages,
7
+ we finally get total 160K pages : 101K ja pages, 47K en pages, and 12K other lang pages
8
  5. random sample 16K from ja mc4, concat with all 16k pages to get lang_all.txt data
9
  6. random sample 10K from ja mc4, concat with ja 10k pages to get lang_ja.txt data
10
  7. tokenize all text with "cl-tohoku/bert-base-japanese"