Tuesday 22 March 2016

Too Much Information? Capita's IT crash continues ...


Capita's IT crash: don't panic, all is in hand. No: actually, do. It's not.

In the post published ten days ago, Mrs Angry told you all about the massive IT failure that was affecting Barnet Libraries: the failure of a system that is supposed to be managed by Capita, as part of our ten year contractual agreement. A failure that still has the library service in a state of virtual paralysis, and will continue to do so until next month, at the earliest. 

In truth, even if the system, or what is left of it, or perhaps a new system, is in place soon, the council now admits it will take as long as six months to recover from the damage done, or as they put it 'to fully populate the gaps in the data' - a typically fatuous Barnet corporate phrase meaning, if you need Mrs Angry to translate: 

'to try to recreate all the data - such as membership details, details of book stock, all issues and returns of stock, all catalogue entries, or any other transaction - which may have been lost in the two year period during which Capita have been charged with the responsibility for the upkeep of the library management system'.

An enormous task, as you might imagine. Especially if it will have to be undertaken just after Barnet Council has sacked 46% of library staff.

The timing of this system crash is hugely significant, coming as it does just before two important meetings, at which our Tory councillors were expected to approve the catastrophic round of cuts which will reduce our library service virtually to one in name only, with a loss of half of all staff jobs, disguised, they would have us believe, by the adoption of a pioneering use of 'open' or unstaffed libraries.

Since the crash, the pilot scheme at Edgware library has of course been out of action. So there are huge implications in terms of new levels of risk for a borough wide use of what the council coyly refers to as 'technology enabled libraries', TEO libraries that have NO members of staff on the premises, and must be unlocked by members of the public with a pin number.

On Wednesday this week there is a crucial meeting of the CELS committee: the Children, Education and Libraries committee, at which the devastating Tory library cuts, which include the pioneering use of highly controversial unstaffed libraries, are put before members for approval. It is likely that this decision is referred to the next Full Council, but ... who knows?

The reports for the CELS meeting were published- rather late - last week; 600 pages with little change from the terrible 'options' that went out for nonsultation with the few members of the public who knew about it. 

Small concessions, probably from fear of legal challenge, on the unstaffed libraries, mean that CCTV will now be live, rather than recorded, which will be of great comfort to you if you are mugged or assaulted in an empty library - unless of course your assailant has the presence of mind to launch their attack out of camera shot, behind the shelves. 

We are told that there will be times when the live CCTV will not be available, so security guards, at enormous cost will be provided, as they were in the pilot scheme, thus making it not a pilot scheme for what was planned, that is to say unstaffed libraries without any security guards. 

The fact that the council is willing to spend money on security rather than library staff, and in contradiction of the pretence of imposing this regime in order to make savings, is very interesting, is it not? But then the supposedly cash strapped council is happy as usual to splurge money on the ideological obsession with outsourcing, and the rejection of responsibility for providing council services, by throwing £6 million of our taxes on the alterations that their library plans will require. To save £2 million. Another example of 'easynomics'.

The only other real concession is that - big deal - rather than all children under the age of 16 being banned from these libraries, hello: now it will be all children under the age of 15.



Now do try children, to read as many books as you can, won't you, before those beastly Tory councillors shut the doors and ban you from the new 'open' libraries ... ?

Despite protests about the impact on elderly, pregnant or disabled users, public toilets in libraries will still not be available in unstaffed libraries.

The report going to CELS was published incomplete, with a missing Appendix L. 

That was a serious omission, of course, as Appendix L addresses the Capita library system failure, a development which puts the rest of the report, and the hugely significant decision under discussion, into question.

Only after complaints from library campaigners came the late publication of the appendix, and an agreement that the deadline for public questions would be extended. Without flagging it up, the council changed the time of the deadline, and then tried to block Mrs Angry's probing questions, funnily enough: even though the time of deadline is expressly defined in the council's constitution. 

Update: having pointed out that a refusal of Mrs Angry's questions would be in breach of the constitution, the council has now backed down and graciously 'allowed' her questions.

Hmm, well: on reading the contents of the withheld appendix, it is not difficult to see why the council was so reluctant to put it in the public domain: do read on - Mrs Angry's observations in red:

Issues arising from the failure of the library management system

Cause of the problem

1. On the 3rd March, the Vubis library management system failed, and has been
unavailable since. Emergency backup systems are in place for critical library
functions (issue & return of books) and use of self-service kiosks. Wifi
services and access to public PCs, printers and other equipment have since
been restored.

Note the pointing of the finger at Vubis, rather than Capita ... the emergency backups are strictly limited in scope, revenue is being lost because the data that generates the records of fines etc is inaccessible, and possibly irretrievable,  and no: use of adult PCs has been restored, but not for children.

2. The incident occurred due to a combination of server and system errors. On
2nd March, Infor (the third party support provider for the Vubis application), 

(as far as Mrs Angry understands it, Vubis and Infor are the same thing ...)

... reported to LBB Libraries that the library system was running out of space on
the server. Customer Support Group (CSG) (Capita) responded to provide additional
physical storage. At this time, it was unknown that back-ups for the system
had been failing since the end of December 2015 (unrelated to the storage
issue). 

Why was it 'unknown'? How could they not have been aware, if they were properly monitoring the system?

The automated messages from Vubis alerting a nominated user of back-up failures were not being received. 

Why not?

Investigations to understand why these were not received are hampered due to the corrupted database.

Gosh, that is an unfortunate coincidence: but then ... how do you know the automated messages were not being received?

Consequently, the back-up failed again, causing the system to crash and corrupt.

What do you mean, 'again'? Didn't you spot it the first time?

3. When the server was rebooted, it began to corrupt the data on the system.

Why did you reboot, without checking that a consequence could be data corruption?

Whilst local backup processes were put in place these were backups to the
local machine which also corrupted. The root cause analysis (RCA) has been
concluded to be as follows:

Thought you just said investigations are hampered by ... 'corruption'? 

So how can you carry out a root cause analysis?

4. A number of disk drives on the server displayed hardware failures. These
were replaced and the system was left overnight to rebuild. 

Why was the system 'left' to rebuild. Who was responsible for overseeing this process?

This is a standard system administrative function to resolve a failed disk. 

Why risk more than one disk at a time?

Subsequently the server crashed around 03.54 on 3 March and it is believed that the database files on Vubis became corrupted as a result of, or during, the subsequent required reboots.

'It is believed'. So you really don't know.

5. A local backup process was put in place where data was backed up daily to
the Vubis server as part of the system functionality. According to an
investigation from the application support provider (Infor) these local back-ups
had started failing from 26 December 2015. System alerts were not received
reporting this failure. 

Here we have the crux of the matter. When did the local backup process start, and why did Capita not ensure they were working effectively? If such a system was in place, failures from 26th December would have been evident. 

Investigations to understand why these were not received are hampered due to the corrupted database.

How convenient.

6. The pilot technology enabled opening (TEO) at Edgware library is unavailable
as the entry system user verification feature requires a check between the
card, the PIN and the Vubis database.

Which demonstrates even more clearly how irresponsible it would be to adopt the TEO ie unstaffed library system, which utterly depends on a reliable IT systems provider.

7. A non-corrupted tape back-up from March 2014 is available – this is the last
date a tape back-up was carried out as the server was changed to digital backups
only following this date.

So are you saying everything since March 2014 has been lost? All the data? Why was the server change to digital backups at that time, with no system of checking the efficiency of backups in place?

8. Work is underway with the 3rd party support provider, Infor, to recover data from the corrupted system with the target date for completion by 31st March 2016. The agreed approach is to add this recovered data to the restored 2014 back-up and supplement with data held from adjoining systems and manual records where available.

Which implies all data from 2014 was lost: a catastrophic occurrence, and apparently unprecedented in library system management.

9. The Vubis system consists of different data types such as book, barcode,
borrower and transaction data which are in various conditions for recovery.
However, borrower data is recoverable, as is some of book information.

Some book information is recoverable: how much? Anything at all from the last two years? Fairly crucial information, for a library service, much of whose stock will have been processed in that period.



Populating the data gaps, old style. (One for the ladies of Broken Barnet ...)

The effects of the problem

10.Customers can borrow and return books in libraries. 

True. It's just you don't know which 'customers' - (which is an odd term to use for a service which by law is unable to charge anyone) - which 'customers' have borrowed or returned books acquired in the last two years, as those books presumably do not exist, now that the system has crashed ...

Wifi in libraries has been restored. However, renewals are not currently possible due to inadequate transaction data in the system, and fines are currently being waived. 

No: fines are not being 'waived'. They cannot be calculated, as the system, must we say it again, may have forgotten those books were issued, and in many cases, it would seem, that the books exist at all.

The current library catalogue is unavailable. PC access for Adults is available, but
not for children as there is no way of validating parental consent via an online
tool. 

Not what you say above. Anyway: the children of Broken Barnet might as well get used to being denied access to PCs, as when your cuts are in place, they won't be allowed into any unstaffed libraries, and may not be able to get to any library at all.

Manual workarounds continue to be investigated and implemented.

Marvellous. How amusing that we pay untold millions of pounds a year to the market leader in IT provision, only for them to take us back to the dark ages, pre-technology. We look forward to the reintroduction of Browne Issue cardboard tickets, though sadly without the horn-rimmed spectacle wearing lady librarians to oversee them, in an unstaffed library.

11.The extended opening hours at Edgware library are suspended as the entry
system requires a check with the Vubis database (see above).

And here we have the only demonstration necessary of why these unstaffed, virtually managed pretend 'libraries' are so dangerous: whose risks were not even presented to members until after they had passed initial approval of the idea. Removing the human element of public service may appeal to this current council administration, but it is fraught with risk - including, as we now read, the risk of incurring further costs in the event of failure.

How it will be resolved

12.We have recovered all of the information that is possible to recover from the
system that is not corrupted. 

How much is that then? The corrupted bit? Will you be able to recover, say, everything except the part that explains why no one apparently realised the system was failing?

Infor and the CSG teams are working together to make the system available again by the target date of 31st March 2016. A workaround has been created with the TEO supplier to break the link with the library system while the latter is repaired. Available IT services inside the TEO library will match that of staffed libraries during the unavailability of Vubis. This means that the library catalogue, renewal of books, reservations, some ebooks and e-audio books, and access to PCs for children and teenagers (due to parental consent being stored within Vubis) are unavailable at present. Manual workarounds continue to be investigated and implemented and notified to users as they become available.

Hmm. Manual workarounds only work, don't they, if you have staff to operate them

13.Once the Library Management System (Vubis) is restored, it is estimated that it
will take the libraries service 3-6 months to fully populate the gaps in the data.
In the meantime, libraries will be open and services will be restored as the data
gaps are populated. The extended opening hours at Edgware will be able to be
available during this process (see below).

'To fully populate the gaps' ... mangled corporate claptrap as usual obscuring the truth: you can't 'populate' a loss like this, only (cover it up and) start again.

14.TEO requires names and PIN numbers to be able to operate. Verification
between the door entry panel and the library management system is not
available as the latter has failed. Entry into the building using TEO is therefore
not currently possible.

Yes: we know. Technology Enabled has become Technology Well and Truly F*cked, and a perfect metaphor for the process of outsourcing.

15.Names have been recovered but PIN information is irrecoverable. PINs will
need to be reissued. A step by step process to re-establishing the service,
based on the time required to communicate to all registered users of the TEO
service, has been created. Registered TEO users will be notified of the new
PIN by the 1st April, ready for the target date for re-opening of the TEO hours
of the 1st April.

One might be forgiven for wondering how Crapita will 'communicate' with registered users whose details died in the crash and burn of the Crapita enabled management system. 

By the power of telepathy? 

Semaphore signals by CEO Mr Andrew Travers, from the roof of North London Business Park ? 

But how lovely, that the re-opening will be on April Fools Day ...

How it could be prevented from happening again

16.Since 6th March, new infrastructure has been built with increased physical
resilience in place to back up the system to a secure offsite backup service. 

Ah. Increased physical resilience. Hmm. Mrs Angry's spies have informed her that men with spanners and hopeless expressions have been seen wandering about the location of Barnet's library servers. 

Doesn't bode well, does it? 

Does anyone have a f*cking clue what they are doing?

similar issue could arise only if the server, software and secure back up service were all compromised. While not impossible, this would be an extremely unlikely scenario. 

Ho ho ho.

An extra layer of protection has been added in now having off-site back-ups. This means that the impact of any future outage would be downtime of hours rather than weeks.

You mean ... centralised to a Capita server? Are we paying more for this, by the way? And can we have our data back without any problems when the contract ends?

Contingency measures in the event of a similar incident/complete outage of database/technology

17.In the event of a future whole system data failure, a core library service at Core
and Core Plus libraries would be maintained through the deployment of
additional staff at an estimated cost of £75k per month. 

Whoa: £75K a month? Who pays?

This would be a mix of temporary agency staff and security staff with extra hours and overtime for permanent staff. It is assumed that it would take 1-3 weeks to secure the
services of, and train, additional staff.

So you suggest it is better t0 sack 46% of library staff, and then splurge our money on security and agency staff to run our library service. Beyond belief. And who pays for your cockups?

18.If the system were to fail again while customers were in a core library service at Core and Core Plus libraries, this would not affect a customer’s ability to leave the building. TEO works on entry only – to exit there is a door push button that is independent of the TEO system which would still operate. There are also push-bar emergency exists and if the alarms are activated or there is a power failure the doors default to open.

Who wants to test being stuck in a locked, unstaffed library at night, with a raging fire at your back, and an assurance that the doors will open? Perhaps we could volunteer Cllrs Reuben Thompstone, and Richard Cornelius? 



It says here, should a fire break out in your new 'technology enabled library', please use the extinguishers provided, and then run like f*ck, and hope the doors let you out ... 

19.The core library service would operate from 9 to 5 over six days at Core Plus
libraries and five days at Core libraries. The contingency plan would be
implemented in line with the following timetable:

Week 1: maintain advertised staffed and volunteer opening hours in Core Plus and Core libraries

Week 2 : offer 9-5 opening in Core Plus libraries (and maintain advertised staffed and volunteer opening hours in Core libraries) through the deployment of security guard/agency staff for hours outside of staffed/volunteer hours

Week 3: offer 9-5 opening in Core Plus and Core libraries through the
deployment of security guard/agency staff for hours outside of
staffed/volunteer hours

The extent of detailed planning here rather suggests, does it not, readers, residents and taxpayers of Broken Barnet, that it is considered likely such a problem WILL happen again. 

Let's return to the section subheading: How it could be prevented from happening again

Here is Mrs Angry's suggestion, Tory councillors - and Capita has now given you the ammunition to do this:

Throw the Capita contract in the bin, stop the monstrous level of payments to private contractors, and consultants, and agency staff, and retain the in house service, with investment in libraries that would prevent further deterioration of the service, and support the increasing reliance of less advantaged residents on the resources offered by a professional library service.

This is not just about an IT failure, or one problem of one part of the massive contractual mess you have created by so easily approving the agreement in the first place. This could happen in any service, and the consequences of further data loss will be beyond repair, and hugely expensive. 

This is proof of the risk you have recklessly undertaken, with our money, and our local services. What can go wrong, will go wrong: and here we are, and yes, we told you so, and now the only thing you can do is ... call a halt, press the 'door push button', pray that it works - and get out while you can.

3 comments:

Anonymous said...

Standard audit procedures for backups only ever test that they are being taken. In 7+ years of managing IT systems that are backed up, I've never been asked to show that a test restore of the system has been carried out.

If you're interested, you might like to ask for the schedule of test restores of the system that were done to ensure that the backups were successful. If you never do a restore, all you have is hope, not a backup system.

At the very least, you'll make some auditors squirm a little.

Anonymous said...

Oh yes and:

"4. A number of disk drives on the server displayed hardware failures. These
were replaced and the system was left overnight to rebuild."

Server disks very very rarely just 'fail'. Any competent setup would have the disk in a RAID array, which would ensure that data would survive a failed disk (or number of failed disks), and that the disk could be replaced. Failing server disks often produce myriads of reports that there is an issue before they fall over.

"reported to LBB Libraries that the library system was running out of space on the server. Customer Support Group (CSG) (Capita) responded to provide additional physical storage."

All server systems that I've worked with have always been setup to give plenty of warnings prior to disk space issues (generally, anything under 25% of free disk is a cause for concern).

"Why risk more than one disk at a time?"

It may well be that, dependant on the setup of the system, leaving in corrupted disks may have prevented it from booting at all.

Sadly, I've not time to dig through this more at the second, but I may do later.

Mrs Angry said...

Thank you for your very interesting & useful comments Anon, regarding certain IT issues: am delaying publication slightly, so as to allow me to use the information most effectively ...