Ideas List

A major goal of this project is to jumpstart a true uScript open source community, taking the project outside of its WPI birthplace and into the wider world. We plan to provide future project participants with an ongoing, comprehensive view of the status of the project, where we envision the initiative progressing from here, and how we imagine that might be done.

Those wishing to gain some more insight into uScript should visit the following URLs:

0. Project Management and Structure

The project cannot really take off until it is fully organized as a real open source initiative, with all of the infrastructure that makes such enterprises flourish. The primary goal of this first Summer of Code session is to inaugurate the system as a true open source project, open to the world and workable by those who want to continue to the code development, as well as usable by those who want to begin using uScript for the development of digital transcriptions of ancient manuscripts.

0.1 Revamped Front End CMS

The open-source project itself should have a well thought-out front page, starting from the existing site organized as a content management system (CMS) located at http://uscript.veniceprojectcenter.org.

The current SourceForge repository and site also needs to be made more generally appealing along the lines of other major open source efforts.

0.2 Review, Organization and Validation of Code Base

The Subversion code repository should be reviewed and re-organized for release in order to facilitate open source development. Module validation procedures should be developed and applied to existing code to prepare for core release.

0.3 Developer Blog & Forums

This idea list should itself become a "live" document to expand its reach and availability to the open source community. In parallel, a uScript blog and one or more forums should be created to allow developer interactions and to foster the birth of a real uScript community.

0.4 Bug reporting and Support system (tickets, etc.)

To prepare for core release, we will need to put in place a bug-reporting/ticketing system with alerts and follow ups so that users can receive a reasonable response to issues that will emerge once code is released.

0.5 Release of core

The main goal of this year's Summer of Code is to release a core system that is fully operational, albeit limited in functionality, and is supported by a full-fledged open source infrastructure ready to leverage developers around the world and ready to support pioneering users with adequate resources and response-times.

1. Transcription Assistant Enhancements

Depending on the level of interest in the project, we will continue also to develop the core modules, starting with the Transcription Assistant, which may well represent the bulk of the initial core release. Starting with the existing application, the following enhancements could be made:

1.1 Adding a Throbber/Loading Bar

In the current implementation of the Transcription Assistant, there is no way to monitor the status of any ongoing loading or processing. As a result, users waiting for the program to load a large element or finish an intensive task are left without any indication that the application is still actually functioning. This is addressed in browsers with elements like loading bars and throbbers. Throbbers are icons located typically in the top-right corner of a browser's screen which run a simple animation while the browser is busy. This is a simple method for reassuring the user that all is well and prevent them from mistakenly closing out the application. While our current state of implementation does not necessitate a throbber or a loading bar, almost any extension of functionality would be greatly aided by the presence of such a device.

1.2 Browser History Support

One of the Google Web Toolkit's (GWT) commonly touted features is its browser history support. Since the browser history is tied to interpreted HTML, JavaScript developers often do not or cannot utilize it. That means that when users move between pages in such applications, they cannot rely upon the browser's built-in navigation tools. GWT allows the creation of new history items that can be placed on a browser's history stack. While this is not a core feature, it does not appear prohibitively difficult to implement compared to its benefits in usability.

1.3 User Projects

Realistically, users are not going to simply look through one or two documents for the whole of their research. Instead, they are going to collect lists of documents that they reference, and it would be very useful for the Transcription Assistant to facilitate this. The idea of a project, then, would be to maintain a collection of links to relevant files. Since the project file is simply a list of references, it could easily be exportable to and importable from local files, and could easily be used by multiple users.

1.4 Splitting/Merging Boxes

There is a problematic case in emergent transcription where there is disagreement over whether a section of text is one or multiple words. Worse still, such contested sections may overlap and be interwoven. Our proposed solution would be to create tools for splitting and merging boxes. These splits and merges would themselves need to be suggestions, and as such, would be stored separately from the individual suggestions. Essentially a section that was originally marked as two words could be merged together, creating a third suggestion-storing object for when the boxes were viewed as one. The user would then need some visual indication in order to switch between the two views of the wording. Splitting would work similarly in the other direction.

1.5 Expanding Support for Other Languages and Other Archives

Currently, our project only has a language entry present for English, and has no specialization for different archives or formats. Since we would like to see this project implemented in various countries by different archives, it will be important for future groups to spend time working on implementing new language support and making meta-data friendly to differing formats. The current application already contains the groundwork for implementing internationalization. Also, thankfully, most archives primarily use an item reference number that contains all of the information they need, but the accomodation of changes such as specialized forms must nevertheless be anticipated.

1.6 Straight-to-Text

A straight-to-text tool would be fairly simple to write, and is very necessary in the next release of the software. This tool would take the top ranked suggestion for each word (or the word that the current user has chosen) and output the words in sequence (as defined by the user with the help of the auto-ordering algorithm). The output could go to an output file to be saved locally, or to a secondary “Preview” tab.

2. User Experience and Management

Once the core Transcription Assistant is released, we will need to begin managing users, so we should plan to release tools to support them in parallel with the initial core release.

2.1 User Forums

Another useful addition to the uScript system that future work should implement would be a set of user forums. These would allow interaction between users, beyond simply assisting one another in transcribing manuscripts, by way of offering suggested transcriptions and giving an thumbs-up-or-down vote on existing suggestions. Users would be able to have more open discussions with one another about transcribing and the uScript system in general. Along with providing a much stronger sense of community and cooperation, user forums could serve as a mechanism with which to collect feedback about the usability of the system and suggestions for improvement.

2.2 User Management

Another feature that would be nice to see in future iterations of the Transcription Assistant is user management. The possibility of malicious users working within the system should be recognized as a potential problem and some protection scheme put in place to combat behavior that is clearly subversive. One simple way this could be accomplished is by allowing users to report other users who they suspect are causing trouble. Reputation and credibility management also present an opportunity both to improve the usability of the Transcription Assistant and help offset the efforts of malicious users. While a simple reputation system exists in the current system, a future team should take the time to devise a more sophisticated scheme.

2.3 User pages

Many well-established, successful websites have user pages that act as a sort of customized headquarters to make browsing that site easier for each user. A user homepage on the uScript website should list transcriptions they have worked on recently, a user's projects (see 1.3 above), transcriptions that are similar in content to manuscripts that user has worked on, and similar transcriptions listed by era or location.

3. Overall System Improvements

Together with specific functions, more "academic" and "research" issues will also need to be addressed to improve the system's performance.

3.1 Metadata Management and Searching

The current uScript system provides the framework for taking a set of metadata obtained from (e.g.) the Venice State Archive and allowing users to search for existing transcriptions with those parameters. Ideally, users will be able to seek out manuscripts/transcriptions on a given topic by searching the body of text in a transcription in addition to manuscript metadata. This will increase the amount of related material users are able to discover to further their research. We should explore adopting DSpace to manage this aspect of the project, as well as the addition of a metadata and content (i.e., transcribed text) search feature.

3.2 Enhanced Optical Character Recognition (OCR)

Enhanced optical character recognition has long been a goal of the initiative. As it currently stands, the image is analyzed in order to determine where individual words lie within the page ("auto-boxing"). This is commonly the first step taken in modern OCR algorithms. We would like to see this taken a step further into analyzing the content within the boxes. (As a side note, success of this step would be dependent on a reliable auto-boxing of the image. Please see section 3.5, Algorithm Optimization, for more on this.) The method for analysis is yet to be researched, but some ideas have been proposed. Typical methods are unlikely to succeed in for this project, as the intricacy and variability of manuscript text varies so widely. It wouldn’t be appropriate to compare the word images to any standard set (or sets) of data in order to discover individual letters and proceed to build words. A more feasible option might be to utilize previously transcribed words, paired with some sort of quantitative analysis of the digital image. If, once the words on a page were boxed, the image data within each box could be extricated and stored, individual words could be compared to each other. If patterns matched, the system could use the user recommended transcriptions for one word in order to make suggestions for the other unknown word. It would be an interesting algorithmic problem to solve in how the software could analyze, quantify, and compare each word image. It would be extremely computationally intensive to compare, pixel by pixel, one image to another, and would also be very difficult to determine whether it was a valid “match”. One possible solution that has been raised is the possibility of quantifying the pen strokes within a word image to make for easier pattern matching. Other possible optimizations of this algorithm might take into account the metadata for the manuscript that a word is taken from, such as time period, author, or language. Clearly, a match would be more likely with word images that are known to be from the same author.

3.3 Importing Options

Expanding options for importing a manuscript into the system through the Archive Assistant is a necessary improvement. When importing a manuscript, the archivist should be required to enter a specific set of metadata about the manuscript. A preliminary boxing of each page of the manuscript should also be required. There should be some options presented to the archivist about how the documents should be boxed. Some options might include manually boxing the page(s), manually setting threshold values for the auto-boxing algorithm, displaying some sample manuscript images with preset threshold values for similar manuscripts, or the option to manually box a single page of a multi-page manuscript in order to have the genetic algorithm find optimal threshold values for the rest of the pages.

3.4 Submission Incentives

One of the long-term goals stated early on in the initiative was the desire for some sort of incentive program for system users. This could include but is not limited to monetary credit for providing transcriptions, to be funded by yearly service subscription fees. This could also be enhanced by giving more credit to users with higher transcription reputations, and by charging users for access to advanced features such as an optical character recognition algorithm or to view a transcription. None of these incentives should interfere with uScript's free and open-source nature.

3.5 Algorithm Optimization

Most of the server-side processes have somewhat sophisticated algorithms, and most of them can and should be optimized to run more efficiently. The auto-boxing algorithm uses a smearing algorithm that may not be the most accurate choice. The auto-lining algorithm could be written to run more efficiently. The genetic algorithm was written to run infinitely, or until halted by a user. Realistically, it needs to have some way to decide when it has found a satisfactory solution. The inner workings of the genetic algorithm and auto-boxing are discussed in depth in previous reports, and the current auto-ordering algorithm is explained within the code.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License