Manufacturing errors are often masked with symptoms that lead us away from the root cause. I recently had an experience I would like to share with you that sheds some light on this concept and it involved one of our RealTime Production Monitoring devices and its connection to the Local Area Network (LAN).
The problem started when an application called RTServer needed to be restarted every week. RTServer receives and processes production data from work centers, which is how manufacturers can monitor what equipment is running and whether it is running efficiently or not. Obviously, this is critical information and the fact that it reports in real time makes it invaluable.
With the economy being what it is, some companies will shut down production on Friday night and restart again on Monday. In addition to saving labor, most companies will turn off power to save energy. Of course, some equipment is critical and has to keep running even when manufacturing ceases.
RTServer is an example of such a piece of equipment. RTServer receives data from a device called a “Machine Monitoring Unit” or MMU. This unit is connected to the LAN and requires a constant power supply. If the power is off, then the MMU is no longer available to pass on data packets. To get the MMU sending again on Monday morning, the manufacturer had to stop RTServer and then restart the application for everything to begin anew.
To resolve this problem from occurring every Monday morning, the customer called into the IQMS technical support team and requested help: “RTServer is not working, we have to restart it every Monday morning, what do we do?” We connect to their network and try to diagnose the problem by checking the MMU to see if it is on and can communicate with the RTServer computer. It does work and communicates just fine every time we connect.
Our very first instinct says, “It has got to be a network problem,” to which the customer says, “Our network consultant says the network has been checked out and it’s fine.” OK, so where do we go from here? (Oh, and by the way, there are no errors populating in the event logs or anywhere else)
Our next question is, “Is it possible that the power to the MMU is being turned off when shutting down on Friday night?” to which their answer is, “No way, we only cut power to the machines and equipment in the production area.” This sort of troubleshooting goes on for quite some time, back and forth between the problem being the hardware, no it has to be software, no it’s the network!
We send our client a new MMU, we send them a new computer to run RTServer on, we even connect and monitor when they start the machines to see if there are any clues. I mean we did everything EXCEPT insist they check if the power is being turned off on Friday night.
Finally, we ask IT to set up Enco Ping Monitor, which basically monitors activity on the network. From this, we could see the MMU went offline every Friday night at midnight and came back on Monday morning. WOW! I thought we asked if the power is being turned off!
To conclude this story, as the final closing task on Friday night, the last person to close the doors and lock the building would turn off the light switch, which coincidentally controlled the receptacle that the MMU was plugged into. Hours of frustration and time spent troubleshooting on both sides could have been avoided if only the simple answer was first considered. An important lesson learned on the part of our customer. On our end, we made sure our documentation specified that all RTServer and MMU hardware must be connected to a non-switched power source, so that hopefully this problem won’t occur again.
What seemingly complex problems have you encountered that ended up having simple solutions?