Parallelisation of Translatinor
The application that is being parallelised is called Transinlator and is currently in its infancy. Transinlator is being developed to aid in the creation of Translation Memories (TM) that play an important role in the process of localisation. More specifically Transinlator takes a .resx file (resx) and using existing TMs and services such as Google translate, finds translations for words and phrases found in the resx and writes these to an application specific TM. Notably, the target audience of the application are translators working in the field of software localisation and therefore the application has a graphical user interface. Furthermore, any parallelisation attempts will be designed to increase speed on standard desktop computers.
Current Architecture of the Application
Currently Transinlator is made up of ten classes. Four of these make up the user interface and direct the flow of processing during run time, three can be considered services that once instigated are called upon for processing throughout the runtime of the application and the remaining three when instigated are objects that represent XML documents and translated. Each of these classes are explained in more detail below followed by a description of the current flow.
MainWindow, SettingBox, JRichModel and AboutBox make up the classes responsible for the user interface. MainWindow is responsible for the overall flow control of the application. It instigates classes and makes calls to services and objects which respond to user commands. SettingWindow allows a user to view and update their settings that are stored in a properties file. JRichModel represents the model of the table in the application and defines which cells of the displayed table the user can edit and ensures they are correctly displayed. Finally, the AboutBox displays information about the application.
TranslationService, GoogleService and XMLOutputService make up the services within the application. TranslationService acts as a control flow to the various translation resources. It will check each of the resources for a translation before returning a translated string. GoogleService once instigated with two languages takes a String and retrieves a translation from Google in the target language. XMLOutputService writes to the application specific TMX document, hence it parses and transforms the existing TMX output document.
ResxDocument, BiLangPair and TranslatedTMXDocument make up the objects created during the run time of the application. On creation .Text fields within a resx document are parsed into a hash table making up the ResxDocument object. A BiLangPair object contains terms that have been translated and that are in need of translation and the TranslatedTMXDocument encapsulates a hashtable made of original and target strings from an existing TMX document that was used as input.
Normal Flow of the Application
The list below represents the usual steps taken by a user.
UI components are loaded
Settings are loaded from the properties file
Create new TMXOutputService (file is taken from the settings)
Select a resx file
Create a ResxDocument object (parse the resx file)
Create a BiLangPair object for the new ResxDocument
Adjust UI appropriately (enable/disable)
Create a new TranslationService (with default and target language)
Create TMXDocuments for each of the inputs and the output (Involves parsing an entire TMX document each time)
Create a GoogleTranlsateService
Run each string for the ResxDocument by the TranslationService to get the translations
Create a new JRichModel with the BiLangPair object to display them correctly
Translation confirmed or changed
Translation written to the TMXOutput
Involves parsing the document, writing the new elements and then transforming it
User interface updated
Potential for parallelism
Three areas of potential parallelism have been identified for this application. Each of the areas are independent of each other and may increase the application’s speed. After each implementation the speed increase or decrease will be measured to determine if the new changes speed up the application. To gain the optimal speed up, if the impact of new parallelism is not a gain, the changes to the application will not be kept.
Due to the nature of the application the expected gain will be much greater with larger datasets because the overhead will be better covered. The greatest potential for gain will be within the methods where the terms are translated as they require the most processing power. Below are the proposed areas of parallelism with potential challenges outlined for each.
Run definitions against multiple sources at the same time
Within the sequential implementation of the application, each word that is to be translated is compared to multiple sources including existing TMX files and external translation services. Currently, each word is compared against each source in sequence. These comparisons could be completed in parallel. Each TMX file service running in parallel to each other can access the list of terms to be translated and offer a translation.
To ensure overhead is not greater than the potential gain the algorithm should be written whereby there is sufficient granularity to cover it. Furthermore, because each term only has one translation there are possible race conditions whereby some translation inputs/services have precedence over each other. Both a message passing or shared memory model would lend themselves to this area of parallelism. Message passing would mean that a thread is only looking up definitions when needed however a shared memory model may be preferred. Each may increase speed ups depending on the datasets.
Concurrent threads for user interface and output file
At any point in time the application may be maintaining the user interface (responding to events), writing to an output file, processing data and reading input data. All of these processes could be run in parallel as parallel threads and may lead to a minor speed up. However, there are some dependencies that need to be considered. The output file requires the user interface, which controls the flow of the application, and the processed data to be in a certain state before it writes to the file. Furthermore, the file being written to is also read by the application. This will require inter-thread communication. Notably, because this application was originally developed using Swing, much of the user interface already exists on separate threads.
Break loops into smaller parts and run in parallel
Throughout the application there are loops that are used for rendering the user interface and performing the translations. These in most cases can be considered embarrassingly parallel however care will need to be taken to ensure that the granularity of the parallelism is sufficient.
Other notable changes that will be made
Due to the current nature of the application’s architecture it has a clear limitation in the amount of memory it uses. To improve speeds within the sequential application it only parses each document once. The documents are then stored in a hash map and drawn upon to find translations. Therefore, if too much data is consumed by the application an exception is thrown. The application is more valuable when it can utilise large data sets primarily because it is more likely to find a translation. Through a restructure of the application architecture this can be achieved, however the application will experience a slow down. As a result the application’s optimal sequential form will not be taken as its current form because the application is considered to be incomplete. The optimal sequential application will be taken as the fastest application that can parse at least 40MB of TMX files without throwing an exception.
As mentioned previously the most computationally intensive area of this application exists in the getTranslatedTerm() method within the class TranslationService. As a result it is within this method that significant speed up could be achieved.
Before modification, each word that needed to be translated was passed through if statements and sent to various services for translation when needed. A word was passed through many services each containing translations found from TMX documents until the translation was found. The translation was then added to a list of translations to be displayed to the user. The original syntax is below to outline the process.
To make the method more efficient through parallelisation each of these documents could be queried at the same time however, if each of the document services (importedHasTranslation and OutputTM) were on their own thread then a greater amount of communication between threads would be required. In Java this can be achieved using pipes (PipedInputStream and PipedOutputStream). Using this method of parallelisation would quickly become complex and may be difficult to code in a scalable way. Therefore, the decision was made to restructure the application’s architecture to promote simple, scalable, consistent and effective parallelism. Before entering the specifics of the language constructs used the re-written method is below and will be discussed at a high level.
Essentially parallelism was achieved by creating a class SyncBiLangPairs that encapsulates all of the terms to be translated and their respective pairs. This object was then passed to each of the threads and updated by the threads when necessary. The complexities and challenges of this task were more than initially anticipated and will be discussed in the next section in conjunction with the steps taken to improve and optimise its performance.
Java and Multi-threading
This section will outline the constructs used to map the parallelism in the application. Mapping an object to a new core was achieved by using the Thread object. Notably, Java is a relatively abstract language (when compared to C and other languages) and runs on top of the Java Virtual Machine (JVM). Because of this there is no way to assert that a thread is run on a separate core because scheduling of processes is done by a system’s operating system. However, most systems use priority- based round-robin scheduling; meaning that by setting a threads priority to MAX_PRIORITY will result in it running before threads of lower priorities.
Sharing objects between threads raises concerns of thread safety. Each of the threads may be attempting to write to the same BiLangPair at the same time which would result in race condition and possible data corruption. In this application a phrase that already has a translation in the existing translation memory may have been over-written by a translation retrieved from Google. To ensure this didn’t occur and multiple threads weren’t writing to the same object simultaneously the synchronized keyword is used before method declarations that are accessed by multiple threads. The keyword synchronized ensures that only one thread is accessing the object at any point in time by using monitors. When a thread accesses the method it is passed to the monitor and all other threads are blocked from the object. On completion the monitor is released.
For each of the service objects to be run on separate threads they implemented the runnable interface and its method run(). When the thread is started using the start() method the code in the method is invoked.
Steps Taken in the Parallelisation of Transinlator
Five distinct steps were taken in the parallelisation of Transinlator: restructuring the application to a shared object model, parallelising the parsing of the documents, parallelising the GoogleService, shifting to a work sharing model and optimisation.
This phase saw the creation of the class SyncBiLangPairs that encapsulated all of the information needed by the services. The TMXDocument(TMXDocumentActor) and GoogleService classes were modified to take the object as a parameter and their methods were re-written whereby the SyncBiLangPairs is iterated through, translations found where needed and written back to the object. No significant problems were encountered at this point however a great volume of code was modified.
Parallelising the TMXS
At this point the TMX services were modified to implement the runnable interface and the TranslationService modified to run each on its own thread. As a result of these changes some problems were encountered.
During the initial test the user interface was being displayed before each of the threads had completed and then each of the terms were being changed while being viewed. This was because there was an instruction in the code to wait for the threads and was overcome by using the Thread.join().
Secondly, the results being displayed were inconsistent because the Strings in the BiLangPair object were being written to in different orders based off which thread reached a phrase/word first. To combat this the TranslationState variable in the BiLangPair class was used to record whether the translation was needed from the particular service.
Parallelising the Google Service
Google service was parallelised in a similar way however, the thread is created and initialised before the TMX parsers because the service suffers from latency.
Enabling Work Sharing
At this point Transinlator was running in parallel with each document being run on separate threads, however this resulted in more threads needing the be created than cores on the processor and each thread varied in size. This resulted in some cores using all of their resources while others sat idle. In the diagram below note that Thread-8 and Thread-16 are created and perform very little work, there are 13 threads created on a two core processor (only 2 are run at a time, this achieved better speeds) and range of time ran between the threads is large.
To manage the threads more effectively a work sharing model was introduced whereby a given number of threads is created and each requested more work when it completed its current task. To create this model two new classes were written; TMXDocument and TMXQueue.
TMXDocument objects encapsulate all of the information needed by the services to be parsed. This included the path of the file, whether it has already been parsed by another service and the TranslationState that output words/phrases should be labelled with. Similarly to the BiLangPairs, TMXDocument objects will be shared across cores and therefore methods that modify the object were prefixed with synchronized.
A TMXQueue is a collection of TMXDocuments and contain methods that can retrieve a document that is yet to be parsed. When a service finished translating all of the phrases against a TMX document it requests another. Thereby each service will run until all of the TMX documents have been used to perform translation allowing to balance the work between threads and in turn the cores. Note in the diagram below each of the threads run for a similar amount of time and process larger granularity threads. This helps to reduce overhead and balance the load. This can also be considered a type of thread pooling.
Minor changeswere made to the code, some of which helped to significantly speed up the algorithm.
Firstly, using the .join() method meant that the translateTerms() method had to wait for one thread at a time to complete. If it was waiting for a thread that was taking more time than the remaining threads, the remaining threads would be unable to join until completion. To overcome this a CountDownLatch was implemented to record how many threads were still running. When each thread completed it runs the countDown() method. The counter method await() causes the thread to wait until the counter reaches 0.
Secondly, profiling indicated that the best results were achieved when the number of TMX document threads created was equal to the number of cores. Using the Runtime object Transinlator retrieves the number of cores and creates exactly that many threads to balance the load between.
In total, the parallelisation of Transinlator required the creation or editing of approximately 350 lines of code and the creation of 5 classes.
Profiling by definition is the investigation of a program’s behaviour using information gathered as the program executes. In this instance the information collected was how quickly the application runs through the translateTerms() method and how the work is being shared between the processors. To achieve the greatest speed up the work should be shared evenly between the cores using the most efficient code possible.
The speed and processor workload was measured at different stages of the development. Furthermore, it was also measured running different number threads, this data is used optimise the number threads generated ton ensure maximum speed up. The number of threads generated by the Java program directly relates to how they are assigned by the scheduler. To ensure a fair test, all of the profiling was run in conditions as close to identical as possible. This was achieved by ensuring that no other applications were running, testing each instance three times and calculating the average and performing the tests on the same machine.
The speed up through parallelisation is calculated using the best sequential program. The average time taken by the sequential program was 14444ms and 5638ms on a duel and quad core respectively while compared to 27117ms and 14406ms by the optimal sequential application. This means the total speed up was 1.88 running on a two core processor and 2.56 running on a quad core processor (see Figure 1).
Profiling and Threads
Profiling tools were used to track the number of threads being created and how long they lasted for. This process found that the optimal number of threads to be generating was that of the number of cores in the processor (see Figure 2). These statistics played a significant role in the optimisation phase of development.
CPU Load Balancing
Monitoring the degree of work being performed by each of the cores ensured the threads were being shared evenly. The graphs captured show what may be expected that when the process runs in the sequential program most of the work being performed at that point in time is on one core whereas, under threading the work is balanced evenly between the CPUs. The charts below show a quad core processing (with hyper-threading) and the work being done by each core for the application in sequential form and that of the parallel application. Graphs with different numbers of threads generated can be referenced in the appendices. These communicate the link between threads and cores.
The nature of this application means that it can be considered scalable parallelism such that as the number of input files increases so does the speed up you will gain from running it in parallel. Minor profiling results indicated that this was case. Times were taken with the application only parsing one file and compared with parsing 12 files. The results confirmed this hypothesis. While untested, it can also be assumed that if the application was used on a system with a greater number of cores that the speed up would increase assuming enough data is being processed to warrant the additional overhead.
TMXOutputDuring the parallelisation of this application the TMXOutputService has been placed on a separate thread. This area of parallelism has not been discussed in depth because the speed up achieved by doing this was too small to measure. When it is runs it enters a loop within which is the method for writing the current Document object to the given output file. The class also contains a synchronised method invoke() that is called from the main thread. Calling this method notifies the TMXwriter causing it to write the shared Document object.
Tools Used in Development and Profiling
Three key tools were used in the development of this application and further tools were considered and trialled. They include NetBeans, JProfiler, Windows Performance Manager and Scala. These tools were able to aide in all areas of the development and left nothing to be desired.
The original application was developed in NetBeans and therefore by using this IDE(Integrated Development Environment) the transition was seamless. NetBeans provided very useful debugging tools that operated between threads and a lightweight profiler. The profiler was helpful when initially generating the threads to ensure they were in fact being created and to monitor how long they worked.
A trial version of JProfiler was used to offer comprehensive profiling results. While such detail was not required to measure speed up it was able to confirm the times being generated by NetBeans were accurate and allowed source code to be drilled into by thread.
Windows Performance Manager
Windows Performance Manager was used to monitor the systems’ cores and their current workload. This aided in optimising the number of threads created to best balance the work amongst the cores. Furthermore, it was able to assert that the threads were being mapped to separate cores appropriately throughout the development process.
Scala was trialled as an alternative to standard. Despite the intention of Scala to simplify parallelism, moving the application to it would not have had great benefit because not enough code was going to be written to justify the overhead of the transition.
The project was difficult to start, however after beginning curiosity led its direction. By parallelising this application I have learnt the importance of profiling and optimisation in parallel development and the practical aspects of Java. My learnings extend beyond what has been discussed in this report; I undertook in depth research every time I implemented new multi-threading constructs to ensure I implemented them appropriately. I do nonetheless believe that by selecting a more sophisticated algorithm I could have better demonstrated my understanding of parallel programming constructs and algorithms, however I found locating an appropriate application difficult. Each application I found was either too large to parallelise in a semester or didn’t stand to gain a noticeable increase through parallelisation. In searching for a project I scoured SourceForge, GitHub, Google Code and CodePlex.
Despite my apparent short fall I have successfully parallelised the computationally intensive section of Transinlator. During this implementation I have taken into account the overhead generated by creating a thread, synchronisation issues, scalability and load balancing. Furthermore, in implementing the XMLWriter on a separate thread I have used Java’s inter-thread communication constructs (notify() and wait()). From this I believe that Transinlator in its current form is close to fully optimised for multiple processors with one exception:
If two files are being parsed on two processors and one is very large while relatively, the other is very small, the load wouldn’t be well balanced and one thread may sit idle while waiting for the other. Whilst this is highly circumstantial, it could occur. To better optimise the application in this instance the larger file should be split to allow better granularity in the load balancing. Furthermore, I believe the application could benefit from optimisation outside of parallelism. I didn’t do this due to the context of this project.
In completing this project I have also learnt a diverse range of Java specific libraries that can be used in parallelisation, such as those outlined in the code and about JavaThreadPools, the volatile keyword and much more. This is notable because a significant portion of the world’s applications are written in Java and therefore the ability to code parallelised programs in Java is important.
Secondly, to parallelise Transinlator I had to re-structure the code to enable shared objects. From this I have learnt the software architecture required to write a parallel application in Java. Given the constraints, in my opinion, Java would rarely be the best language to write parallel programs in. This is primarily because it exists at a relatively high level and therefore has less control over the mapping of threads to processors.
Finally, I have rarely used profilers in the past. By completing this project I have found them to be useful to follow the behaviour of an application. I have learnt how to use the profiler as a debugging tool and more importantly for optimisation.
In summary, through exploring parallel constructs and tools during the parallelisation of this assignment I have learnt how to successfully parallelise a Java application.