Disclaimer: This is a technical article intended for software developers. It’s full of techno-waffle, so if coding isn’t your thing, please feel free to read the rest of my blog where I go into astronomy, technology, gadgets and general geekery. You have been warned!
As a freelance software developer, I get all kinds of requests. I spend most days in my home office from morning ‘till night designing and coding for Windows desktops although occasionally a tasty project comes my way and I have been known to code interfaces with a wide variety of hardware and sometimes even from and for a train, boat or a plane.
But I have never been asked to develop Windows software for non-Latin countries. Arabic-speaking Saudi Arabia. Compared to developing software for Latin-alphabet countries, Arabic presents a number of challenges. The characters don’t resemble anything I’ve ever seen and it’s read from right to left. Visual Studio data types can be easily adapted for Unicode strings (hint: don’t use UTF-8 or its ilk to store non Latin character sets) so the bulk of the task is getting all the various words and phrases migrated from English to the destination language, in this case Arabic.
In the old days, a Micro ISV wanting to translate one of their applications from one language to another would typically make a copy of the source and painstakingly step through it. It’s easier these days with modern development techniques and tools but the end result is the same, the same product localised. The trouble with Arabic-speaking countries is that not only do you have to translate text strings but their numerical values too. Oh, and did I mention that they’re all right to left and conceptually work their way backwards through forms too?
For this particular project, I’m creating a prototype of an existing Windows .NET application. A big application, one with over 400 form classes and well over 2,000 supporting classes. Clearly putting together a quick prototype without affecting the original solution is going to take some time without some form of automation.
After creating a new copy of the solution (we’re not localizing one solution dynamically as the localisation involves substantial business logic changes) we want to be in a position whereby all hard coded strings have been extracted to XML files which can then be read by a translator’s third party tool and then re-imported into the solution, a different set of XML files depending on which language you’re using. If you’re just wanting to localise the strings (e.g. from English to French or Spanish) then the process is largely the same but you’ll have to manage multiple language XML (.resx) files within a single solution and maintain the logic workflow for all regions on the same code base.
The ‘trick’ of the trade is to firstly set the “Localizable” attribute on all of the forms to “True”. This can be accomplished with a Visual Studio Macro to loop through the forms and force the Visual Studio designer to refresh itself. Visual Studio will automatically move all localisable hard coded strings from controls on a form’s class (e.g. MyForm.Designer.cs) to the form’s own resource file (e.g. MyForm.resx). If you have custom controls with custom string properties, remember to add the <Localisable> attribute to your properties first so that the designer knows to extract the hard coded strings to the resource file, too.
At this point we have our unmodified English version and our new “Arabic” version which has all of the form UI text in resource files but fundamentally it’s the same solution. The next step is to extract all of the hard coded strings from your forms themselves, placing the hard coded strings in the resource file and leaving a reference to the resource file string location in the form code.
This is the time consuming bit. Done by hand, you’ll be highlighting and cutting and copying and pasting a lot and will likely make errors. Whilst Visual Studio has a nice automatic procedure for the first step (so automatic that you may not even notice any change at all), there’s seemingly nothing built in to handle this simple refactoring job. There’s a Visual Studio addon that was written for Visual Studio 2008 and it’s called, “Resource Refactoring Tool”. It works with Visual Studio 2010 and gives you a right click context menu option called, “Extract to Resource”. Simply highlight the string, right click and then select “Extract to Resource”. It will be done for you, but you still have to wade through, in this case, 400 complex forms which can easily number in the tens of thousands of strings awaiting extract.
If I extracted one string every ten seconds, that’s looking like an entire week of effort and that’s without stopping to breath.
So I went looking for a tool that would loop through a given set of classes and extract strings from them into a resource file. I checked everywhere and found only one viable offering from a company known as Lingobit. Their flagship product, “Lingobit Localizer” claims to be a one stop shop for localizing software products and I dare say that it looks good. It’s also very expensive, costing about the same as Adobe Creative Suite 5. They’ve recently released a little product which doesn’t claim to do any translating but it does fit the gap for what is essentially a glorified text to XML parser. And did I mention that it’s £200?
Maybe calling it a glorified text parser is being a little unfair since one could say the same thing about Visual Studio, although a basic solution text parser that allows for filtering of results and some form of XML-compatible export routine is a big omission from Visual Studio and there aren’t any other addons or tools that provide this functionality. Both ReSharper and CodeRush come close with their refactoring tools but aren’t powerful enough to insert a new string resource and refactor more than a single line of code at a time.
Basically, Lingobit Extractor has a very simple interface. You create a new Extractor Project after which you load the solution (it works for many different programming languages and not just managed code – as I said earlier, it’s a text parser) and then write your filters. Filters are a comfortable method of searching for strings and you can have more than one filter per project. In fact you need more than one filter and the effects are cumulative. For instance, I wanted to exclude all SQL and reserved names from being extracted as this would break the solution and prevent it from compiling. Out of the box, this process is a little frustrating as the application should provide some basic existing filters depending on the type of solution loaded.
String filtering. It took a couple of hours to get all the various filters right, but it was time well spent as it reduced the amount of unwanted strings in the list prior to exporting to the resource file.
After you’ve done this, it’s a simple case of selecting the projects and classes that you wish to translate on the left navigation pane, extracting the strings to a temporary editable table and then, if you’re happy – exporting the results to a new or existing resource file. The naming of the strings is fully controllable as well as the filters being easy to use and flexible and no there are no Regular Expressions in sight, although they are supported if you want to use them.
Once your source files have been loaded and your filter has been configured correctly, you execute the parser and view the newly created resources. It’ll show you a preview of your source (top right), the newly named string references and their values (bottom right) and you can either save the referenced strings into an existing – or a new – resource file. Splendid. Just make sure that you’ve spent sufficient time at the string extraction stage to ensure that you’re not translating any say, SQL statements. I found that even some source code specific keywords ended up getting parsed which broke compilation, so be careful and check everything.
The next step is to send the .resx files over to your translator… most translator will accept them and those that can’t, say because they’re simply native language speakers without the tools for editing xml files, you can use a tool such as TransView which has a (free) viewer and a (paid) Visual Studio addon that parses through your projects, combines the resource strings into a single proprietary file ready for translation. Your translator then has a very simple job of filling in the boxes and it even includes tools for translating over the web (thanks to Google Translate) and auto-filling duplicates so you’re not translating “OK” for the thousandth time.
Anyhow, I quite liked LingoBit Extractor. It did the job for me but its very existence begs the question as to why these string extraction refactoring features aren’t available within the Visual Studio IDE and even aren’t included in the two main developer productivity tools, CodeRush or Resharper.
What I liked: It parses fast. If The Flash could parse files, he’d parse them this quickly. A solution with over 500 forms was parsed in seconds.
Annoyances: Selecting multiple resources and choosing to delete only deletes one, not all of the selected items. Some things like #Region aren’t supported, neither are the default values for optional parameters in method declarations. This means that you have to filter these sections out. The trouble is, the filter properties are stored in the tool’s project file and cannot be shared between projects which is a bit of a problem. The project file is an XML file so it’s not hard to make a ‘template’ by copying out the <Filter> elements if you need to copy them from one project to another. The support URL at the time of writing is throwing a HTTP 500 which wasn’t helpful. Also any reserved name strings like “Error”, “And”, “Date” will by default create escaped versions in the resource file (e.g. “_Error”) but the escaped version isn’t used in the refactored code. So it’s a pain going through and changing them all by hand afterwards. After a few hours, I noticed that all my .resx files had “TRIAL” written into the String. It was a bit of a problem.
If you can live with the annoyances and have lots of classes to parse then I highly recommend LingoBit Extractor and optionally, TransView.
That’s all for now, folks..