How librarians support TDM in the research environment


I presented on the topic of how librarians can support text and data mining (TDM) in the research environment at the recent Text and Data Mining Symposium organised by the University of Cambridge Office of Scholarly Communication on 12th July. For anyone watching the livestream, I also gave a speech about ethics in librarianship during the panel discussion at the end so I hope people enjoyed that!

Without any further ado, here is a written up version of my talk with bonus slides for anyone who is interested.

I presented the talk with my colleague Yvonne Nobis who facilitated the Q&A session at the end. I started by introducing the slightly bizarre building within which we both work and provide services to the research community here at Cambridge.


This is the Betty & Gordon Moore Library which is the main University Library’s STEM collection for the entire University. We cover lots of different subjects as well as being the departmental library for the Faculty of Maths, so needless to say it can all get a bit complicated at times!

We’re probably going to get a lot of these today but according to the Intellectual Property Office, TDM is: “use of automated analytical techniques to analyse text and data for patterns, trends and other useful information.”

But what is useful information and more to the point, what is the usefulness of TDM?
In his report for the Publishing Research Consortium (Text mining and scholarly publishing, 2013), Jonathan Clarke says that TDM can be used to do several things:

Enrich content – mining can improve indexing, be deployed to create relevant links, to improve the reading experience.


Conduct systematic reviews of literature – mining can help researchers systematically review larger bodies of content, faster than they could do it themselves and to keep up with their field without missing relevant information.

Discover – mining can be used to create databases that can themselves be mined.


Carry out Computational Linguistics research – mining itself is the subject of research, for example to improve the extraction of meaning from texts.

All of these are useful things and in the HE environment, but more often than not we will be dealing with the use of TDM to conduct systematic reviews of the literature.

So what is the TDM workflow for researchers and where do librarians fit in? This workflow was taken from a presentation given by Ann Okerson at IFLA (IFLA WLIC 2013) and it is a simplified workflow. Many of these points will have more details subsections and  researchers might do them in different orders or double back on themselves as the project progresses. But it gives a good overview of the process in a nutshell:
1. Identify questions that need to be asked
2. Identify suitable sources to be mined
3. Access resources
4. Download resources to local host
5. Create programming to ask questions
6. Analyse and interpret the results

So where do librarians fit in on this workflow? Librarians have a role to play at every single stage of this process.

Identify questions that need to be asked
This is basically a reference interview. The bread and butter of what we do. Through talking to researchers and understanding what they’re working on, we can help them identify what they’re trying to ask and what the best keyword combinations might be to get the best results.


Identify suitable sources to be mined
Again, part of our existing repertoire. We can advise on available services as well as advise on the licencing of those services, including any potential risks of triggering any DRM or other “usual activities” issues. We can also facilitate any conversations needed between our researchers and the publishers whose resources they wish to mine.

Access resources
Much of this is covered in the last point but we can help facilitate access to whatever researchers need.

Download resources to local host
We can work with researchers to ensure that the necessary IT support is in place by connecting them with the relevant services as appropriate. We can also advise on appropriate research data management to help researchers manage the information once they have it.

Create programming to ask questions
While there are many researchers out there doing fun stuff with code, many do not have the skills to start TDM on their own. So it is up to us to help facilitate their learning whether that is through delivering our own sessions on how to write code or getting people into our spaces who do know to deliver sessions for our users.

We did this recently by inviting some researchers in to deliver a session on GitHub for our users. We would not have been able to deliver this session ourselves but we could provide the space and support to help such a session run and it was very successful.

We can also promote existing tools, APIs, and platforms like GitHub or local companies like ContentMine.

Analyse and interpret the results
Again we can offer training and work closely with researchers on this analysis and interpretation to help them carry out this work using our librarian skills. We can also support by connecting researchers with colleagues working in similar areas so they can pool their knowledge and resources, or simply connect people will resources that can guide them through the interpretation process.

You might notice that there is a pattern forming here. Librarians are in a really good position to help facilitate so many aspects of the TDM process and while we may know how to do some of the activities required ourselves, this isn’t a critical thing that we have to do. More importantly, we can connect people with resources and other people who do know things. We can facilitate so much of these processes without always needing to have the core skills knowledge ourselves (even though that does help!).

Librarians fully supporting TDM
But to be able to fully support our researchers we need to be doing several things as a profession. Reilly (2012) said we “need to fully understand practices of users and integrate that into licencing and research development work.”

As my Engineering Library colleague Kirsten Lamb said when talking about her work as an embedded librarian with a research group here, asking a researcher “how can I help” puts the onus on them to identify their own needs which isn’t the right way to do things. Instead start by asking questions such as “what are you working on right now?” and take the conversation from there.

And with engagement in mind, how many of us are promoting to our users that in 2014, amendments to UK copyright legislation following the Hargreaves review of IP law means that researchers are legally allowed to perform TDM for non-commercial reasons, providing they have legitimate access? By legitimate I mean, are they logged into stuff with a valid Raven/authentication ID? Then this is legitimate.

How many of us are working with our researchers to understand what they’re doing and how we can help them with TDM and other areas? How many of them even know that we can?

And possibly the slightly thorny question – how many of us are comfortable with putting ourselves out there as people who can support TDM? These services will require some upskilling and awareness raising on the part of library teams but the fundamental skills are still already there, we just need to adapt them to a different context.

We can develop this support and even the upskilling part too through building relationships with our immediate research communities. We can learn from them while they learn from us, creating a fantastic collaborative partnership where both parties benefit from each other’s expertise. We can also provide services that hopefully reduce the high entry level required by some aspects of TDM, especially by applying our own experiences of understanding this potentially new topic directly to our teaching.


Starting with something simple, we can offer guidance through online training and resources.
Slide22One good example of this is this shiny and new Cambridge TDM LibGuide. This was developed through collaborating with multiple library departments across the University and was led by the eresources team at the main University Library.

This screengrab features a YouTube video that I made on TDM to give a light overview of what is a complex topic. It also has a helpful section on what various database providers allow and don’t allow so people can be prepared for what they can expect if they need to mine a certain resource and how that will pan out for them.

However, having these online resources does not absolve us of responsibility with regards offering ongoing support. It is just the beginning and we need to have localised expertise and support to back up this sort of content.
Slide23Thanks to this excellent article by Jane Secker (and others) for this point. We need to integrate TDM into our existing copyright teaching as well as covering other related topics such as Creative Commons licences, data protection law (TDM is still bound by data protection rules), as well as advising on next steps such as contacting the IPO in cases of needing to report a publisher for not allowing TDM when they should be through techniques like excessive DRM technologies.

We can also advocate for our users. We can encourage and support them in pushing the limits of what publishers will allow, or are legally bound to allow, with TDM. We need to be bold here and not be cowed by the more powerful providers that we have to engage with. Also, we are in a helpful position where we can raise awareness of our campus needs with providers. We can encourage publishers in supporting licencing and research support development, by offering to work collaboratively with publishers to co-develop licence principles and services that reflect what we are seeing on the ground. We can facilitate such discussions and represent a variety of viewpoints and help build collaboration across sectors.

But as has been some of our experiences in the past, not all publishers are willing to work with us which is why we then need to pressure them as customers to provide the services we expect from paying some often not inconsiderable amounts of money for.

With that in mind, we also have a responsibility to not sign service contracts that hinder TDM or demand that contracts actually acknowledge TDM as some are still quite vague on the issue, if they mention it at all.

While the 2014 law changes make many contract stipulations unenforceable, publishers can still put measures in place to “reasonably restrict” TDM activities to protect the stability and security of their platforms. What classes as reasonable has not yet been defined in case law and circumventing any existing restrictions is still illegal. But as I have already said, as customers we are in a position to work with (and sometimes against) publishers in getting the best deal for our users.

Speaking of working with providers, as it currently stands, librarians are key in getting access to databases set up and being intermediaries when access is denied, so we are in a position to see accessibility issues with these resources from many different angles and at many different stages.

Here’s the techy bit

Librarians provide vendors with a range of IP addresses that cover their campuses and networks where the access requests will come from, these are considered to be on campus. Authentication systems like Raven/EZProxy route users with off campus IP addresses through a single IP within the range, even if the person is logging on in New Zealand. This IP address is highlighted when notifying vendors of the IP address range as there is often a greater amount of user traffic via this single IP.
Some publishers may place technical protection measures on their online content which are triggered when “unusual activity” is spotted.

This activity can sometimes be benign or malign but it isn’t always immediately obvious. This will affect access to a single IP or Networked cluster of computers. If the IP concerned is the off campus/Raven IP then this will block access to all off-campus users. Access is often restored efficiently but can sometimes involve extensive investigation especially in more serious breaches such as someone’s credentials being used without their knowledge. This process also involves extensive contact with the vendors themselves on behalf of the University.
(Text provided in part by colleagues at University Library who deal with this on a daily basis!)

So TL;DR – it’s still really complicated!

We facilitate access while also being bound by terms and conditions. We try to provide access to all when facilitating certain levels of access might trigger events which then restricts access. As librarians, we’re in a really difficult position at times.
We are ethically obligated as a profession to ensure access to knowledge is as open and as easy as possible while also having to address the more restrictive elements of that knowledge access. Good institutional support to tackle these issues is critical.

So looking to the future…currently TDM requests, either for support from us or direct to publishers are still relatively low but I would argue that this is in part due to us not promoting TDM as an option and offering the relevant support. The more we do the more demand will grow and publishers will have to engage with us to develop solutions that ensures quick, easy and secure uses of their resources.

The future of TDM depends on a radical shift in the current publishing process and librarians must be a part of that shift, and even lead from the front. Open Access and other such movements are just the beginning.

Thank you.

(Any comments or question, leave them below!)

All images CC0 from Pixabay (with exception of Moore Library image)

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: