Kirsty’s response to Question 1: Big (?) Data and ML

The world of data is, at first glance, an unfamiliar one for those of us who make our living from literary and cultural representations. We are trained – and we train our students – to ferret out nuance and connotation, to read between the lines or beyond the page, to find the multiple meanings surging around a simple word like ‘home’ or ‘nation’ or ‘language’. And Modern Linguists, like Ginger Rogers, do all this backwards and in high heels – or at least, in multiple linguistic, geographical and cultural contexts.

In the world of data, of course, our tried and tested strategies of interpretation do not wash. Trying to impute nuance, connotation and multiple meanings to a spreadsheet is a pointless task, rather as if your precious data is at the mercy of a translator who understands only one language and doesn’t get nuance. A computer will do exactly what you tell it to do, and only when you tell it using the one expression it has been programmed to understand (no stray punctuation and definitely no connotation).

But let’s not overestimate the problems. In fact, once you get past the initial encounter (awkward first data?) and see things from the computer’s point of view, much about working with data plays to our strengths as Modern Languages researchers. They are programming languages, after all, each with its associated social, cultural and pragmatic milieu. You could even say that Modern Linguist vs XML or SQL or [insert your programming language of choice] is the ultimate intercultural encounter.

In all seriousness, Modern Languages researchers not only have much to gain from data-driven humanities projects, but we also bring a very particular array of skills to the table. We are ideally placed to develop a reflective, intercultural approach to digital/digitized data and the tools that allow it to be captured, stored, curated, shared, analyzed and transformed. We need to make our case

Gathering data – qualitative, quantitative, numerical, categorical, bibliographical, biographical, topographical, you name it – is just the beginning of the process, and if we lack the technical tools to transform it into something else, well, that’s what collaboration is for (and that’s a Good Thing, by the way). But once the data is gathered and transformed, and ready for meaningful engagement, that’s when our expertise comes into play.

As Modern Languages researchers, we can combine our proficiency in representation, its nuances and connotations with our ability to consider the commonalities and differences of engagement with digital/digitized data and tools across cultures and languages. Out on the global web, data-driven projects and tools such as crowdsourcing, community archives, emotional geographies, or genealogical databases provide unprecedented opportunities to leverage the digital as a means of stimulating investment and even participation in Modern Languages research by individuals and communities who would never, even for a second, regard themselves as modern linguists. Let’s grab them!



3 thoughts on “Kirsty’s response to Question 1: Big (?) Data and ML

  1. You make important points about how linguists are in a good position to become DHers and build bridges between different cultures of knowledge. Data, big or small, can be a word that is often associated with problems of processing and storage that require technical know-how and specialist programmes to deal with it. Do we need to query the word data and consider the forms in which we are already proficient in considering it? If data is simply a collections of facts, why is it such an alienating word for ML specialists?


  2. As well as loving the Ginger Rogers metaphor (the first time to my knowledge that us Modern Linguists have been compared to a glamorous, all-singing, all-dancing movie star…), I particularly like what Kirsty has to say here about the advantages for all of us in taking into consideration ‘big data’ approaches. It certainly is true that our training in close textual analysis does seem, at first glance, at odds with data-driven approaches and the manipulation of spreadsheets. But, as Kirsty says, there is much to be gained, and it’s not a case of us leaving our close analysis and nuanced understanding behind.
    As I read what Kirsty says, I was reminded of some pieces I’ve read recently on Tim Hitchcock’s excellent blog about big data approaches in the discipline of history. Tim argues that the best uses of big data approaches are when they complementary to close textual reading, allowing us to do both a ‘distant reading’ in the context of 127 million words and a close reading, seeing particular case studies in their geographical and social context. Kirsty’s arguments seem to chime with what Tim is saying in a different disciplinary context, and I’m sure there is a fruitful dialogue to be had there as ML continues to make its way in the big data debates.


  3. Thank you both for your thoughtful responses. Niamh, I absolutely agree about challenging our fear of the term ‘data’ . As the opening paragraph of my University’s research data management policy puts it,

    ‘All researchers produce ‘data’ in the course of their projects and investigations [I like the scare quotes!]. Without research data there is nothing to base research outputs on and more and more the data produced by a project can be seen as a research output in and of itself. All researchers are used to handling research data and disciplines have, over time, developed best practices in dealing with research data – be that data from a scientific instrument, e-lab notebooks, audio files of participant interviews, text transcripts or images from a gallery’ (1).

    It is a question of demystifying the concept, that’s true, but it’s also a question of complicating it right back up again. I said above that there’s a seeming conflict between the nuances and connotations of textual interpretation, and the ‘spade’s a spade’ language necessary for turning text into data. What I didn’t say, and should have done, is that the process of turning text or other information into ‘data’ is as complex as any translation, perhaps more so.

    Here’s a data story: I collected biographical information about a group of several thousand people from the Luso-Hispanic world who settled in Liverpool in the nineteenth century. My people were from more than twenty countries and spoke six or more languages between them. Some were literate, some were not. What they all had in common was that they came to Liverpool and, one day in 1871 or 1881 or 1891 a census enumerator knocked at their door and shoved a census form into their hand. The information they (or a literate friend or neighbour) put on that form was collected in, transcribed by a harassed and probably monolingual clerk, and stored in a big book in London. A hundred years later, it was digitised and transcribed again by a harassed and probably bilingual technician in India. And then it was put online and transcribed once again by a harassed and more or less trilingual academic in the UK (me) and put in an Excel spreadsheet. After that, I gave my spreadsheet to a harassed academic technologist fluent in English and several programming languages, and asked him to turn it into a database. Which we almost have.

    At every point in which the data passed from one form to another, multiple decisions had to be made. Some were at the granular level: how to transcribe an unknown Basque, or Filipino, or Spanish surname, or whether ‘Lisbon, Spain’ is a factual error or an insight into somebody’s worldview. Others were at the level of ontology: what names, definitions, relations, categories of data will form the building blocks of our new body of knowledge? Manipulating data is, in a very real sense, an exercise in nuance. It requires a solid understanding of the historical, geographical, material and linguistic context in which that data was generated, a clear idea of the contexts in which the data can or might be used, and sustained reflection on each step of the process of turning it into something else. Like Claire, I found Tim Hitchcock’s reflections on data and the uses of the past extremely compelling, in particular his argument that digital tools such as ‘nominal record linkage, building on a generation of work undertaken by family historians, should allow us to tie up and re-conceptualise the stuff of the dead, as lives available to write about’.(2)


    (2) Tim Hitchcock,


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s