Title: Acronym Finder ^^^^^^^^^^^^^^^^^^^^^ Team Detail: Abhishek Agarwal (05329022) Saurabh P. Singh(05329037) Ramanand J.(05329402) *************************** Abstract: ^^^^^^^^ In this project, we designed an Acronym Finder using a freely available CRF package[1]. Details about the package can be found at Source Forge. In this features based approach, we have tried to capture various positional, regular expression based features. We have complete the design and implementation of the tool, but currently we are facing the problem of lack of datasets, because of which we cant comment on the accuracy. CRF package: It tries to classify the data sequences based on various features. The training, as well as testing part comes with the package. We were left with two major task: 1) To format the datasets, and make an in-memory representation of the data sequences, and provide the interfaces which package expects. 2) To code features specific to Acronym Finder. Features coded till now are: a) Regular Expression features, like all letters capital, having parenthesis, etc. b) Positional Features: Given an Acronym, Axxx Bxxx Cxxx Dxxx(ABCD) at i th place, the i-j the token should begin with |i| - j + 1 letter. Similarly, it checks on the other side also. Currently it does not takes care of the stop words like and, of etc. Formatting the datasets: ^^^^^^^^^^^^^^^^^^^^^^^ We have downloaded around 100 pages form wikipedia, from different field and then converted manually into the XML files with following tags: and . After that we designed a XML parser for that which parsed the tagged files and stored the data in various objects, in accordance with the interface provided by the package. Features: ^^^^^^^^^ Other than the common features provided by the author, there were some features specific to Acronym Finder which were designed. Accuracy: ^^^^^^^^ Currently, we are facing the problem of lack of tagged data sets, as it requires a manual input, process is very slow. Therefore the results currently observed are not the right reflections of the accuracies: Confusion matrix once observed is as follows: ACTUAL OTHER ACTUAL ACRONYM OBS OTHER 100 0 OBS ACR 50 80 This clearly shows that we were having high recall value but low precision. We were also able to find the reason for following observation. Lack of training data resulted in higher weights for some negative features, like word with initial letter Capital, but because of lack of negative example it gained high weight. We have discussed with the author and we feel that if we increase the data set to a reasonable amount, we can achieve pretty good accuracies. ******************************************************************************