- duration of the outage: ~8 hours: 6:00am CST - 2:00pm CST
- Impact: 100% of users impacted, the entire search application and integration from chicagoresourcehub.com was down the entire time, which meant that users could not use the application to search, which is what it was meant for.
- I had disabled an option, which locked out Google API or any user from selecting (i.e. querying) data and working with it such as presenting it to the user that made the request.
- At approximately 1:00pm CST
- A customer noticed the service was down and submitted a form through the contact form on the website.
- When I received the notice, I began trying to help the user troubleshoot to see if it was the users problem. In the past, Caspio's web application would go down at certain moments in the day for brief periods, and Caspio's service was immensely more limited than Google's service and would go down for over usage. So, I did not immediately assume that the problem was something that I had created and could fix.
- This initial assumption turned out to mislead the initial investigation, and was a part of my own part and contribution to the problem. I had incorrectly assumed and responded to a situation in which the customer was to blame for not understanding the application properly or a connected service was down. Thus, my incorrect misleading solution was to teach the user how to use the application. With the more experience that I have had, I now realize that even if the custom didn't understand the application, this is still a part of my problem as I have not properly created an application that users can easily understand. Additionally, I have realized the immense amount of mistakes that I can make as a developer and have learned to be much more open to finding the problems that I have created or not properly handled in my development.
- The incident did not escalate any further than myself and the customers involved.
- I eventually resolved the issue by realizing that I had disabled an option in the service that simply needed to be enabled in the Google Fusion Tables options.
Root cause and resolution:
- Fortunately, this issue, while having a devastating impact on my application by shutting it down, was extremely easy to resolve. In specific, the issue, occurred in the Google Fusion table itself, which had become my new database management tool. With this service, I had disabled this checkbox option: Reuse access: ■ Allow downloads, so that nobody would improperly use the thousands of records that I had been maintaining. The impact of this was that any party excluding myself, would not have access to download the data. I did not recognize this error because myself and anyone on the internet was still able to visualize the data from the Google Fusion table itself since it was an open database with no viewing restrictions; they just could not download the data with Google’s services. It turned out that the API that I was using with Google API and the server that the request or query for the data came from would also not have access to any of the information in the database through downloads. At the time, I did not realize that the way the API works is that when a query is made for the data from the web server that had my web application, this query needed to be able to download the data. I guess that in my amateur mind, I thought that the computer would interact with the database the same way that a human would, and since there were no viewing restrictions, a query could be made without restrictions.
Corrective and preventative measures:
- Since my switch to Google Fusion, this as been the only time my data has been down besides a few 15 minute blocks of scheduled maintenance. For the most part, Google Fusion has turned out to be a much better solution than Caspio's Web Application. Since this error was so minor, besides enabling downloads, I have done nothing else to further prevent this issue.
- This experience and all the other steps that I've taken to manage this web application have taught me a lot about how to better prevent down times and improve user experiences. I have been much more careful in how I help others to troubleshoot issues realizing that any problem that a user has is also my problem. No matter what the issue is with a user not being able to understand or access my software solutions, I realize that I have a role in trying to improve the user experience. While this issue may not have specifically helped me in this category, I have also learned a lot more from managing chicagoresourcehub.com about how my software technology should seek to provide solutions that are scalable, easy to manage, easy to update, easy to integrate across services and platforms, and easy for customers to work with.