Update README.md
Browse files
README.md
CHANGED
|
@@ -4,7 +4,7 @@ this is ja cc filter for reference from ja wiki vs random ja mc4, and build with
|
|
| 4 |
2. crawl 300K of 4M webpages from the urls
|
| 5 |
3. get pure text and remove content len less than 1k,
|
| 6 |
4. use langdetect to tell the lang of the pages,
|
| 7 |
-
we finally get total
|
| 8 |
5. random sample 16K from ja mc4, concat with all 16k pages to get lang_all.txt data
|
| 9 |
6. random sample 10K from ja mc4, concat with ja 10k pages to get lang_ja.txt data
|
| 10 |
7. tokenize all text with "cl-tohoku/bert-base-japanese"
|
|
|
|
| 4 |
2. crawl 300K of 4M webpages from the urls
|
| 5 |
3. get pure text and remove content len less than 1k,
|
| 6 |
4. use langdetect to tell the lang of the pages,
|
| 7 |
+
we finally get total 160K pages : 101K ja pages, 47K en pages, and 12K other lang pages
|
| 8 |
5. random sample 16K from ja mc4, concat with all 16k pages to get lang_all.txt data
|
| 9 |
6. random sample 10K from ja mc4, concat with ja 10k pages to get lang_ja.txt data
|
| 10 |
7. tokenize all text with "cl-tohoku/bert-base-japanese"
|