Rolyer's Blog: 拼音语法检查

拼音语法检查: "

本程序是把输入的字符串转化为以空格间隔的拼音串，

如输入“zhongguoren'，则会输出“zhong guo ren'.

另外程序也利用了Pinyin4j.jar的包，处理开始时先把中文汉字转化为拼音（但对多音字支持不好，如:银行-->yin xing），先不管这个问题。

说白本程序就是把输入的英文字符串，按照拼音规则分割，不过其中也遇到一些问题，现在记录下来。其实网上我也找过，不过就是没实现出来。

第一种做法：把拼音字典从a ai an ang一直往下读入内存（其实共407个拼音而已），对字符串，从长度为1开始至字符串末尾，不断截取，在拼音字典内二分查找，

若存在，则连接下个字符，继续在拼音字典内二分查找；

若不存在，则证明这是拼音的最大匹配了，就作为结果保存；

直到字符串结束。

（测试多次后发现有bug，bug 1: deng,xiong这类拼音不能识别，加了判断ng结尾（已修正）；

bug 2：对tian这个拼音识别成： ti an两个拼音。

本人觉得如果再加判断就会显得很难看。所以放弃这种做法。

其中pinyinDict的数据结构是String[]. DictOper.readDict()方法是读文件，并把每一行转化String,最终返回String数组。

拼音字典文件格式如下：

a

ai

an

ang

ao

ba

bai

...

代码如下：

 /**
  * 通过拼音字典，二分查找是否存在拼音<br/>
  * 贪婪算法，最大匹配拼音序列, bug: tiantian会分成ti an ti an
  * 
  * @param inputChar
  * @return
  * @author chow 2010-8-25 上午10:58:22
  */
 @Deprecated
 public String _processString(char[] inputChar) {
  String temp = new String(inputChar);
  String[] strArray = temp.split(" ");
  StringBuffer result = new StringBuffer();
  if (pinyinDict == null || pinyinDict.length <= 0) {
   pinyinDict = DictOper.readPYDict();
  }
  for (int i = 0; i < strArray.length; i++) {
   String curStr = strArray[i];
   int curStrLen = curStr.length();
   boolean existWord = false;
   for (int beginIndex = 0, endIndex = 1; endIndex <= curStrLen; endIndex++) {
    String tmpKey = curStr.substring(beginIndex, endIndex);
    int index = Arrays.binarySearch(pinyinDict, tmpKey);
    // int gap = endIndex - beginIndex;
    if (index >= 0) { // 存在，则继续找下个字符
     if (endIndex == curStrLen) {
      result.append(tmpKey + " ");
     } else if (endIndex + 2 <= curStrLen
       && curStr.substring(endIndex, endIndex + 2).equals(
         "ng")) {
      // 若后面两个字母是ng，则接上去
      result.append(curStr
        .substring(beginIndex, endIndex + 2)
        + " ");
      beginIndex = endIndex + 2;
      endIndex = beginIndex;
      existWord = false;
      continue;
     }
     existWord = true;
     continue;
    }
    // 不存在且前面一个字符是存在的
    else if (existWord) {
     result.append(curStr.substring(beginIndex, --endIndex));
     result.append(" ");
     beginIndex = endIndex;
     existWord = false;
     continue;
    }
   }
  }
  // System.out.println("result:" + result.toString());
  return result.toString();
 }

第二种做法：

改变拼音字典的数据结构，采用Map<String, List>的格式，如下：

a:[a, ai, an, ang, ao]

b:[ba, bai, ban, bang, bao, bei, ben, beng, bi, bian, biao, bie, bin, bing, bo, bu]

...

程序的思想是：从字符串的第一个字符开始，若字符在字典里存在，则取出其对应的拼音串，并从该字符往后截取1个至5个字符，

（如：输入tianxia，第一个字符为't',则取出t:[ta, tai, tan, tang, tao, te, tei, teng, ti, tian, tiao, tie, ting, tong, tou, tu, tuan, tui, tun, tuo],

并截取t, ti, tia, tian, tianx, tianxi 六组字符串，取出最长一个匹配串[tian]，将结果保存，从'tian'的'n'后一个字符开始循环，直至到字符串结束。

 /**
  * 根据Map查找是否存在对应的拼音<br/>
  * 贪婪算法，最大匹配拼音序列
  * 
  * @param inputChar
  * @return 以空格间隔的拼音字符串，eg: zhong guo ren
  * @author chow 2010-8-25 上午11:00:07
  */
 public String _processStringByMap(char[] inputChar) {
  String temp = new String(inputChar);
  temp = PinyinHelper.toHanyuPinyinString(temp, outputFormat, "");
  String[] strArray = temp.split(" ");
  StringBuffer result = new StringBuffer();
  if (pyData == null) {
   pyData = DictOper.getPYData();
  }
  for (int i = 0; i < strArray.length; i++) {
   String curStr = strArray[i];
   int curStrLen = curStr.length();
   int beginIndex = 0, nextWordIndex = 0;
   while (beginIndex < curStrLen) {
    String firstLetter = curStr.substring(beginIndex,
      beginIndex + 1);
    List<String> list = pyData.get(firstLetter);
    if (list == null) {
     beginIndex += 1;
     nextWordIndex = beginIndex;
     continue;
    }
    for (int subLen = 1; subLen <= 6; subLen++) {
     if (beginIndex + subLen > curStrLen) {
      break;
     }
     String piece = curStr.substring(beginIndex, beginIndex
       + subLen);
     if (list.contains(piece)) {
      nextWordIndex = subLen + beginIndex;
     }
    }
    // 若不存在任何匹配，begin和next都向后移一位
    if (nextWordIndex == beginIndex) {
     beginIndex += 1;
     nextWordIndex = beginIndex;
     continue;
    }
    String subStr = curStr.substring(beginIndex, nextWordIndex);
    result.append(subStr + " ");
    beginIndex = nextWordIndex;
   }
  }
  if (result.length() == 0) {
   result.append(temp);
  }
  return result.toString();
 }

做法二可以很好识别拼音串，但回头想想，其实程序还可以优化。

因为每个拼音字母可以组成的拼音的长度范围是可预见的。就是说以't'开头的拼音最短为长度为2（eg: ta），最长为4（eg: tian）；

这时只要改变拼音字典的数据结构就可以了，写个程序统计一下各个拼音最长和最短的长度，更改后的拼音字典为：

a:[a, ai, an, ang, ao]

min:1,max:3

b:[ba, bai, ban, bang, bao, bei, ben, beng, bi, bian, biao, bie, bin, bing, bo, bu]

min:2,max:4

对于min和max可以用Map<String,Integer[]> pyLenMap保存，

for (int subLen = 1; subLen <= 6; subLen++) {
     if (beginIndex + subLen > curStrLen) {
      break;
     }

对于上面的for循环内的1与6可换成pyLenMap的min和max。

这样程序循环的次数就能更少。

另外附上拼音字典。

作者: bosshida

声明: 本文系JavaEye网站发布的原创文章，未经作者书面许可，严禁任何网站转载本文，否则必将追究法律责任！

已有 0 人发表回复，猛击->>这里<<-参与讨论

JavaEye推荐

Rolyer's Blog

2010年9月3日星期五

拼音语法检查

没有评论:

发表评论