Smi字幕是韩国比较流行使用。
关于SMI字幕请看http://en.wikipedia.org/wiki/SAMI
主要格式如下:
<SAMI> <HEAD> <TITLE>SAMI Example</TITLE> <SAMIParam> Media {cheap44.wav} Metrics {time:ms;} Spec {MSFT:1.0;} </SAMIParam> <STYLE TYPE="text/css"> <!-- P { font-family: Arial; font-weight: normal; color: white; background-color: black; text-align: center; } #Source {color: red; background-color: blue; font-family: Courier; font-size: 12pt; font-weight: normal; text-align: left; } .ENUSCC { name: English; lang: en-US ; SAMIType: CC ; } .FRFRCC { name: French; lang: fr-FR ; SAMIType: CC ; } --> </STYLE> </HEAD> <BODY> <-- Open play menu, choose Captions and Subtiles, On if available --> <-- Open tools menu, Security, Show local captions when present --> <SYNC Start=0> <P Class=ENUSCC ID=Source>The Speaker</P> <P Class=ENUSCC>SAMI 0000 text</P> <P Class=FRFRCC ID=Source>French The Speaker</P> <P Class=FRFRCC>French SAMI 0000 text</P> </SYNC> <SYNC Start=1000> <P Class=ENUSCC>SAMI 1000 text</P> <P Class=FRFRCC>French SAMI 1000 text</P> </SYNC> <SYNC Start=2000> <P Class=ENUSCC>SAMI 2000 text</P> <P Class=FRFRCC>French SAMI 2000 text</P> </SYNC> <SYNC Start=3000> <P Class=ENUSCC>SAMI 3000 text</P> <P Class=FRFRCC>French SAMI 3000 text</P> </SYNC> </BODY> </SAMI>
因为它支持大部分的HTML,因此呢我直接用html来解析字幕主要是:smiTemp = smiTemp + Html.fromHtml(data).toString();
由于如果一般的smi字幕文件大概在200k到300k这样子,如果全部解析的话呢,花的时间也太长了。因此我就仅仅解析时间和字幕,分别用两个ArrayList来保存字幕和时间点。如下:
ArrayList<Integer> timeMills = new ArrayList<Integer>();
// smi context
ArrayList<String> messages = new ArrayList<String>();
ArrayList<ArrayList> arrays = new ArrayList<ArrayList>();
同时呢也用到正则表达式。具体见如下代码,不过解析200K左右的字幕也需要15s左右,还没有找到更好的办法。
import java.io.BufferedReader; import java.io.InputStream; import java.io.InputStreamReader; import java.util.ArrayList; import java.util.regex.Matcher; import java.util.regex.Pattern; import android.text.Html; public class SmiProcessor { // 字幕读取 // start time ArrayList<Integer> timeMills = new ArrayList<Integer>(); // smi context ArrayList<String> messages = new ArrayList<String>(); ArrayList<ArrayList> arrays = new ArrayList<ArrayList>(); String data = ""; boolean start_flag = false; boolean started = false; String smiTemp = ""; Pattern patternstart = Pattern.compile("<SYNC Start=(.*?)><P Class=(.*?)>", 2); Matcher matcher = null; Pattern time = Pattern.compile("><P Class=(.*?)>", 2); Pattern spaceP = Pattern .compile("<SYNC Start=//d+><P Class=//w+> ", 2); Matcher spacematcher =null; public ArrayList<ArrayList> process(InputStream inputStream) { System.out.println("SmiProcessor"); try { // InputStreamReader inputReader = new InputStreamReader(inputStream, "EUC-KR"); BufferedReader br = new BufferedReader(inputReader); while ((data = br.readLine()) != null) { if (!started) { // 去除前面的字幕相关头部分 2 表示不区分大小写 matcher = patternstart.matcher(data); if (matcher.find()) { started = true; start_flag = true; } } if (start_flag) { // 取得时间点 matcher = patternstart.matcher(data); if (matcher.find()) { if (smiTemp != "") { messages.add(smiTemp); smiTemp = ""; } timeMills.add(Integer.parseInt(data.substring( data.indexOf("=") + 1, data.indexOf(">")))); spacematcher = spaceP.matcher(data); if (spacematcher.find()) { smiTemp = smiTemp + " "; } } else { smiTemp = smiTemp + Html.fromHtml(data).toString(); } } } arrays.add(timeMills); arrays.add(messages); } catch (Exception e) { e.printStackTrace(); } return arrays; }