Project sites:
http://www.cs.washington.edu/education/courses/cse490f/07wi/project_files/camera/
http://panlingual.org/ (hosted by Utilika Foundation)
Report URLs:
http://www.cs.washington.edu/education/courses/cse490f/07wi/project_files/camera/rereproto/
http://panlingual.org/rereproto/
We report here an increment in the development and testing of a user interface for a panlingual camera phone. In this increment, we changed the deployment platform from a low-fidelity Web emulation to a mobile telephone. While doing so, we also modified the interface's logic in some ways on the basis of results of our Web testing.
The device-based prototype has functionalities that permit users to perform the following representative tasks with it:
"Understanding something around you" (easy): While in an unfamiliar place, you see some written and printed texts around you in a language you cannot read. Get a translation of one of these texts into English.
"Making yourself understood" (moderate): You have met somebody with whom you have shared interests but no shared language. The person knows Hattanese, but you do not. You have something to say to the person. Write it in English, get it translated into Hattanese, and show the translation to the person.
"Understanding a scene you saw earlier" (difficult): You have made several photographs of scenes around you while visiting unfamiliar places. Later on your trip, you want to understand better the scenes that you saw. Choose one of the photographs in your collection and get a translation of the text in that photograph into English. If you can, check whether the text in the photograph was correctly read. If you see obvious errors, correct them before the translation proceeds.
These tasks are similar to those performed by users in tests of previous versions of the prototype, but are more advanced in some ways and less advanced in other ways that the predecessor tasks, because of new capabilities and new limitations of the prototyping tools used. In particular, the new tasks permit the user to make real choices of scenes to photograph, permit the user to enter text directly instead of writing it on paper and photographing it, permit the user to check and correct the output from the text recognizer before that output is translated, and permit existing images to be used as translation input rather than only new photographs. Conversely, the new tasks do not rely on previously implemented capabilities of zooming in on text regions and obtaining word-based metadata.
The Web-based user testing gave rise to recommendations for changes or re-evaluations with respect to five aspects of the interface design, and we made design modifications in these aspects.
1. We added some content to the help subsystem. The help button (represented in the prototype with "??") displays a general orientation on the functions available in the current state. Each control, when pressed and held, displays a detailed description of what it does. Figure 1 shows two examples of the control help.
![]() Before |
![]() After |
| Figure 1. Modifications in Help Subsystem. | |
2. We added an option to insert inspection, cropping, correction, and approval steps between the capture of a photograph and its translation. This option permits the user to check a photograph for legibility, correct errors in the automatic recognition of text in the photograph, and limit the part to be translated. To make this change, we separated the "Capture & translate" button into two buttons, "Capture" and "Translate", as shown in Figure 2. Our assumption is that users will understand or quickly learn that "Translate" includes the capture of the currently viewed image. Users who choose "Capture" then get a view of the raw image, with no text processing. Thus, the opportunity to limit what is submitted for translation is spatial rather than textual. In this version of the design, the user may draw any shape, and the zoomed-in space will be the shape's minimum bounding rectangle.
Before |
After |
| Figure 2. Disaggregation of Capture and Translate Functions. | |
Figure 3 shows a sequence of four states, when the user captures a photograph and limits the area to be translated.
![]() Camera start state |
![]() Raw-image state |
![]() Spatial-zoom start state |
![]() Spatial-zoomed-in state |
| Figure 3. Disaggregation of Capture and Translate Functions. | |
When the user requests a translation, the revised design offers two options. Single-clicking the "Translate" button makes the application identify the image text's language and recognize the text. The user can correct the system's guess of the source language. The recognized text is user-editable, and the user can therefore correct recognition errors. Clicking the "Translate" button again produces a translation from the recognized text. The recognized text moves to the top of the screen, and the translation (with its language label) appears below. If the user initially double-clicks the "Translate" button, both operations take place in immediate succession, so the user indicates trust in the identification of the source language and the recognition of the text.
3. We gave the user more flexibility in how to specify a text region to zoom in on. Previously, the only method was to touch the first word, then the last word, of the region. The revised interface also permits the user to enclose the region by drawing a line around it, and to mark the first and last words by drawing a line from one to the other. The instruction has become general enough to cover all three methods, and the user gets a description of the methods when asking for help on that instruction. The second and third methods decrease the zoom-in operation from a three-step to a two-step one. Figure 4 shows the changed initial instruction. If the user touches a single word in response to that instruction, then the instruction changes to ask the user to touch the last word, as before.
Before |
After |
| Figure 4. Addition of All-in-One Zoom-In Specification Methods. | |
4. We added a line linking the source and target words to the annotations shown when the user touches a word of either text. This treatment makes use of the gestalt principle of uniform connectedness. Figure 5 shows this addition.
![]() Before |
![]() After |
| Figure 5. Addition of Connecting Line to Lexical Metadata. | |

Figure 6. Launch State.
5. We provided additional methods for acquiring source texts for translations. The previous design required the user to take a photograph and use it immediately as the source. The revised design adds three more methods. (1) The user may use an existing image. (2) The user may use an existing text. (3) The user may enter new text by any method supported by the device, such as typing, handwriting, or recognizable speech. Existing images and texts may be found in documents stored in the device or retrieved from other sources, such as Web pages.
Extending the design to permit multiple input sources, we have created a new launch state, which precedes the standby state. The launch state, shown in Figure 6, identifies the application and asks the user to select the input source. The user may select the camera as the input source, moving the application into the camera start state shown above in Figure 3. The user may select an existing image, making the application invoke the device's image-library management utility and then use the user-selected image in the raw-image state shown in Figure 3. If the user selects an existing text or a text-entry method, the application gives the user access to the device's functionalities that return either whole text documents or text passages. On the assumption that the user selects all and only the text that the user wants translated, the application performs a translation and proceeds to the existing result state, except that the source text is displayed as printed text rather than as an image.
The "Exit" button in all existing states no longer quits the application after saving any unsaved source; it now invokes the launch state, where the user may select an input source of any type or exit the application. The "New photo" button remains unchanged, on the assumption that a new photograph will remain the most frequently preferred input type. This assumption will be examined in future testing.
We implemented the device-based prototype on a Cingular 8125 mobile telephone. It includes a 1.3-megapixel camera, a 39-key alphanumeric keyboard, and a 320 x 240 color touch-sensitive display.
We developed the prototype's code in C# with the Windows Mobile 5.0 Smartphone QVGA Emulator of the Windows Mobile 5.0 SDK for Smartphone in Microsoft Visual Studio 2005.
The device permitted us in principle to implement the features of our design more realistically than previously used prototyping tools had. The development environment allowed us to place high-fidelity controls into the prototype with minimal coding effort.
However, there was a mismatch between the capabilities of the emulator and those of the device. The emulator did not provide any access to the device's camera, nor to the device's touch-sensitive display for input. Thus, using the emulator, we could not implement the "Capture" function and could not use any of our design's on-screen buttons for user actions. We were able to access the device's camera within the SDK, but all testing of that access had to be performed on the device itself. And even the SDK did not give us access to the touch-sensitive display. We had to replace the buttons with menus activated and navigated by hard buttons below the display. Although this prototype's tools are generally high-fidelity tools, in contrast to Denim, they were in fact lower in fidelity with respect to screen-based button input.
The development environment was also inefficient because it required the use of one particular operating system, which was not the most productive or available OS for all members of our team. A further inefficiency was the long time (5-10 minutes) the emulator required for initialization.
The device-based prototype has a launch state offering an exit button and a single menu with four options (Figure A1). Three of the options are kinds of input: a new camera photograph, an existing image, and a new typed text. The fourth option is to review already performed translations.
Of the launch-state menu options, the first three have been implemented.
When the user creates or selects an image in the camera state or the image-selection state, the application moves to a recognition state (Figure A5). The image is displayed above, with its language labeled, and its recognized text (currently a fixed text for each stored image) is displayed below. In this state, the user may in principle change the language label in order to correct the system's guess as to the language of the text, but the prototype does not yet handle this action. The user may immediately edit the recognized text to correct errors in it. When the user is ready to have the text translated, the user can press a "Translate" button.
When the user presses the "Translate" button, the application moves to a translation state (Figure A6). If the original is an image, it disappears and is replaced with the recognized text, including any changes made by the user. A translation appears below, labeled with its language. For image originals, the translations are currently fixed. For text-input originals, the prototype establishes a TCP session with a prespecified server and transmits the entered text to the server, where a human operator sees the text and replies with a translation, which is displayed on the device. In this state, the user may in principle change the source text's or target text's language, but the prototype does not yet handle these user actions.

Figure A1. Prototype Launch State.

Figure A2. Prototype Camera State.

Figure A3. Prototype Image-Selection State.

Figure A4. Prototype Text-Input State.

Figure A5. Prototype Recognition State.

Figure A6. Prototype Translation State.