[00:27:36] 06Operations, 10DBA, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Gerrit shows HTTP 500 error when pasting extended unicode characters - https://phabricator.wikimedia.org/T145885#2998926 (10Paladox) [01:06:16] PROBLEM - MD RAID on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:06:16] PROBLEM - configured eth on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:06:26] PROBLEM - Check whether ferm is active by checking the default input chain on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:06:52] ostriches ^^ [01:07:03] gerrit seems down to me [01:07:06] RECOVERY - MD RAID on cobalt is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [01:07:07] RECOVERY - configured eth on cobalt is OK: OK - interfaces up [01:07:09] back up [01:07:16] RECOVERY - Check whether ferm is active by checking the default input chain on cobalt is OK: OK ferm input default policy is set [01:09:33] 06Operations, 10Gerrit, 06Release-Engineering-Team: Gerrit went down for 1minute on 05/01/17 - https://phabricator.wikimedia.org/T157203#2998958 (10Paladox) [01:10:16] PROBLEM - dhclient process on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:10:16] PROBLEM - salt-minion processes on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:10:16] PROBLEM - Check systemd state on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:10:16] PROBLEM - DPKG on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:10:16] PROBLEM - MD RAID on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:10:54] Any ops in here for ^^ [01:11:06] RECOVERY - dhclient process on cobalt is OK: PROCS OK: 0 processes with command name dhclient [01:11:06] RECOVERY - salt-minion processes on cobalt is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:11:06] RECOVERY - Check systemd state on cobalt is OK: OK - running: The system is fully operational [01:11:06] RECOVERY - DPKG on cobalt is OK: All packages OK [01:11:06] RECOVERY - MD RAID on cobalt is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [01:11:38] paladox: hi [01:11:48] hi [01:12:19] 06Operations, 10Gerrit, 06Release-Engineering-Team: Gerrit went down for 1minute on 05/02/17 - https://phabricator.wikimedia.org/T157203#2998985 (10Paladox) [01:14:15] 06Operations, 10Gerrit, 06Release-Engineering-Team: Gerrit went down for 1minute on 05/02/17 - https://phabricator.wikimedia.org/T157203#2998988 (10JustBerry) p:05Triage>03High Very important (if not unbreak now). [01:15:46] 06Operations, 10Gerrit, 06Release-Engineering-Team: Gerrit went down for 1minute on 05/02/17 - https://phabricator.wikimedia.org/T157203#2998990 (10Paladox) p:05High>03Triage [01:16:00] 06Operations, 10Gerrit, 06Release-Engineering-Team: Gerrit went down for 1minute on 05/02/17 - https://phabricator.wikimedia.org/T157203#2998958 (10Paladox) p:05Triage>03High [01:16:06] PROBLEM - puppet last run on elastic1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:17:35] 06Operations, 10Gerrit, 06Release-Engineering-Team: Gerrit went down for 1minute on 05/02/17 - https://phabricator.wikimedia.org/T157203#2998958 (10Paladox) [01:19:16] 06Operations, 10Gerrit, 06Release-Engineering-Team: Gerrit went down for 1minute on 05/02/17 - https://phabricator.wikimedia.org/T157203#2998994 (10JustBerry) Investigate channel logs in #wikimedia-operations as necessary: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20170205.txt [01:22:31] 06Operations, 10Gerrit, 06Release-Engineering-Team: Gerrit: Investigate why gerrit went down for 1minute on 05/02/17 and then again 4 minute later - https://phabricator.wikimedia.org/T157203#2998996 (10Paladox) [01:31:26] PROBLEM - puppet last run on francium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:35:06] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 1809.945646 Seconds [01:36:06] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 43.706968 Seconds [01:37:51] 06Operations, 10Gerrit, 06Release-Engineering-Team: Gerrit: Investigate why gerrit went down for 1minute on 05/02/17 and then again 4 minute later - https://phabricator.wikimedia.org/T157203#2999000 (10Paladox) p:05High>03Unbreak! Setting unbreak as it appears loading gerrit changes is taking longer then... [01:39:26] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=80%) [01:44:26] PROBLEM - puppet last run on db1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:45:06] RECOVERY - puppet last run on elastic1046 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [01:59:26] RECOVERY - puppet last run on francium is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [02:09:56] 06Operations, 10Gerrit, 06Release-Engineering-Team: Gerrit: Investigate why gerrit went down for 1minute on 05/02/17 and then again 4 minute later - https://phabricator.wikimedia.org/T157203#2999010 (10Paladox) [02:10:47] 06Operations, 10Gerrit, 06Release-Engineering-Team: Gerrit: Investigate why gerrit went down for 1minute on 05/02/17 and then again 4 minute later - https://phabricator.wikimedia.org/T157203#2998958 (10Paladox) [02:11:26] PROBLEM - Disk space on elastic1026 is CRITICAL: DISK CRITICAL - free space: / 2219 MB (8% inode=90%) [02:12:26] RECOVERY - puppet last run on db1029 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [02:21:08] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.10) (duration: 08m 20s) [02:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:26:26] !log l10nupdate@tin ResourceLoader cache refresh completed at Sun Feb 5 02:26:26 UTC 2017 (duration 5m 18s) [02:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:33:26] PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100% [02:33:46] RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 36.14 ms [02:37:43] ladies and gentlemen, ORES is overloaded: https://grafana.wikimedia.org/dashboard/db/ores?panelId=9&fullscreen [02:39:04] any ops around? [02:39:18] Amir1: gerrit issue? [02:39:54] nope, ORES is overloaded I need to increase number of workers to reduce the load [02:42:56] 06Operations, 10ORES, 06Revision-Scoring-As-A-Service, 10Revision-Scoring-As-A-Service-Backlog: ORES Overloaded (particularly 02/05/17 2:25-2:30) - https://phabricator.wikimedia.org/T157206#2999062 (10Ladsgroup) p:05Triage>03Unbreak! [02:43:18] Amir1: done ^^ [02:43:39] JustBerry: Thanks [02:44:16] I make a patch for now, I need Ops to review it [02:45:26] 06Operations, 10ORES, 06Revision-Scoring-As-A-Service, 10Revision-Scoring-As-A-Service-Backlog: ORES Overloaded (particularly 02/05/17 2:25-2:30) - https://phabricator.wikimedia.org/T157206#2999065 (10JustBerry) Number of workers needs to be increased to reduce the load. Temporary patch being created by... [02:45:26] Amir1: logged [02:45:41] thanks [02:47:04] Amir1: which ops (from https://phabricator.wikimedia.org/project/members/1306/) [02:48:37] 06Operations, 10Gerrit, 06Release-Engineering-Team: Gerrit: Investigate why gerrit went down for 1minute on 05/02/17 (dd/mm/yy) and then again 4 minute later - https://phabricator.wikimedia.org/T157203#2999067 (10JustBerry) [02:49:41] (03PS1) 10Ladsgroup: ores: Increase capacity [puppet] - 10https://gerrit.wikimedia.org/r/336048 (https://phabricator.wikimedia.org/T157206) [02:50:03] anyone who can +2 in operations/puppet [02:50:10] so Operations team in WMF [02:51:22] I'm so sorry... [02:51:46] Amir1: what? [02:51:50] robh: moritzm _joe_ bblack [02:52:02] (for mass pining) [02:52:32] Amir1: they're all 9-10 hours+ idle [02:53:05] Amir1: you only changed 40->45 [02:53:07] ? [02:53:26] gehel: madhuvishy mark marostegui [02:53:41] JustBerry: yup, that would work for now [02:53:51] If it didn't work, I'll email ops-l [02:54:28] (03CR) 10JustBerry: [C: 031] "Looks fine. Incremented worker count from 40 to 45 per https://phabricator.wikimedia.org/T157206." [puppet] - 10https://gerrit.wikimedia.org/r/336048 (https://phabricator.wikimedia.org/T157206) (owner: 10Ladsgroup) [02:54:37] Amir1: if that helps ^^ [02:54:52] Thanks! [02:59:33] Sent an email to ops-l, hopefully they get back to us soonish [02:59:48] Amir1: sounds good [03:00:34] 06Operations, 10ORES, 06Revision-Scoring-As-A-Service, 10Revision-Scoring-As-A-Service-Backlog, 13Patch-For-Review: ORES Overloaded (particularly 02/05/17 2:25-2:30) - https://phabricator.wikimedia.org/T157206#2999074 (10JustBerry) Email sent to ops. Awaiting patch review from ops. [03:18:04] It is still sending out ~100 503s / min [03:18:29] Amir1: was the patch approved [03:18:37] not yet [03:18:41] heh [03:19:21] hellp [03:19:23] hello [03:19:26] what's up [03:19:32] hey madhuvishy [03:19:40] https://grafana.wikimedia.org/dashboard/db/ores?panelId=9&fullscreen&from=now-3h&to=now [03:19:54] I think it's a chinese indexing bot [03:20:08] Amir1: I was going to head out to dinner, and saw email [03:20:08] (03PS1) 10Tim Landscheidt: Wait for the Kubernetes pod to shut down after "stop" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/336049 (https://phabricator.wikimedia.org/T156626) [03:20:09] we might issue a varnish ban but I need to dig into hadoop to be sure [03:20:15] aah [03:20:26] Amir1: you don't have access? [03:20:48] madhuvishy: I have [03:20:53] I'm on it [03:21:21] (03CR) 10jerkins-bot: [V: 04-1] Wait for the Kubernetes pod to shut down after "stop" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/336049 (https://phabricator.wikimedia.org/T156626) (owner: 10Tim Landscheidt) [03:21:21] madhuvishy: but in the mean time can we have this merged? https://gerrit.wikimedia.org/r/336048 [03:21:34] Amir1: sure looking [03:21:41] madhuvishy: Thanks [03:21:56] Amir1: is that enough of an increase? [03:22:14] yeah, It reduces the number for now [03:22:30] Amir1: okay - merging, looks fairly harmless [03:22:33] (I need to manually restart workers but not a big thing) [03:22:40] (03CR) 10Madhuvishy: [C: 032] ores: Increase capacity [puppet] - 10https://gerrit.wikimedia.org/r/336048 (https://phabricator.wikimedia.org/T157206) (owner: 10Ladsgroup) [03:23:20] Amir1: how does this look btw https://wikitech.wikimedia.org/w/index.php?title=Template%3ATools_Access_Request&type=revision&diff=1457533&oldid=1002467 [03:23:32] madhuvishy: thanks [03:23:41] Amir1: okay all merged [03:24:06] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 674.97 seconds [03:24:12] madhuvishy: thanks [03:24:22] JustBerry: Nice [03:24:22] 06Operations, 10ORES, 06Revision-Scoring-As-A-Service, 10Revision-Scoring-As-A-Service-Backlog, 13Patch-For-Review: ORES Overloaded (particularly 02/05/17 2:25-2:30) - https://phabricator.wikimedia.org/T157206#2999095 (10JustBerry) +2 and merged by @madhuvishy. [03:24:23] (03PS2) 10Tim Landscheidt: Wait for the Kubernetes pod to shut down after "stop" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/336049 (https://phabricator.wikimedia.org/T156626) [03:25:29] (03CR) 10jerkins-bot: [V: 04-1] Wait for the Kubernetes pod to shut down after "stop" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/336049 (https://phabricator.wikimedia.org/T156626) (owner: 10Tim Landscheidt) [03:27:02] (03CR) 10JustBerry: [C: 04-1] "debian-glue still failing. Will re-review again after another patch is implemented." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/336049 (https://phabricator.wikimedia.org/T156626) (owner: 10Tim Landscheidt) [03:27:06] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 297.97 seconds [03:27:25] Amir1: ^^ ? [03:27:31] Amir1: seems calmer? [03:27:33] not related [03:27:41] madhuvishy: I'm restarting [03:28:53] !log ladsgroup@scb100[1-4]:~$ sudo service celery-ores-worker restart (T157206) [03:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:28:59] T157206: ORES Overloaded (particularly 02/05/17 2:25-2:30) - https://phabricator.wikimedia.org/T157206 [03:33:16] PROBLEM - puppet last run on mw1286 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz] [03:33:54] Amir1: looks better... still monitoring. [03:35:25] The total average in the past hour is down to 100 per minute which is better than 160 half an hour ago [03:36:27] madhuvishy: it's much calmer now [03:36:28] thanks [03:37:10] Amir1: patch-for-review can be removed now? [03:37:24] Amir1: errors popped up again [03:39:43] JustBerry: two nodes still on 40 workers [03:40:07] we have only ten more workers now, and given time to restart it is likely [03:41:48] (03PS3) 10Tim Landscheidt: Wait for the Kubernetes pod to shut down after "stop" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/336049 (https://phabricator.wikimedia.org/T156626) [03:42:36] PROBLEM - carbon-cache@c service on graphite1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@c is failed [03:42:54] Amir1: gerrit or ores ^^ ? [03:43:06] PROBLEM - Check systemd state on graphite1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:43:45] neither, carbon :D [03:44:08] graphite, the system to log which empowers grafana [03:46:38] JustBerry: the WMF datacenter is rather big so it's very likely that nodes here and there and down sometimes (for various reasons) the important thing is that the distributed system is robust to move traffic to other nodes [03:47:10] until they get fixed (which happen automagically most of the time) [03:47:15] eh [03:48:41] Amir1: are you good? i'm about to leave [03:49:25] madhuvishy: yup, thanks [03:49:29] I'm monitoring [03:49:38] (03CR) 10JustBerry: "Removing -1 per successful test builds." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/336049 (https://phabricator.wikimedia.org/T156626) (owner: 10Tim Landscheidt) [03:50:03] The average number of 503s is going down [03:50:40] okay, i'll have my laptop, so ping if anything [03:51:09] Thanks! [03:52:07] PROBLEM - puppet last run on lvs1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:53:51] This looks a little bit worrying. [03:54:14] but can't say for sure [03:54:31] (puppet failure is not a big deal usually though) [03:54:36] RECOVERY - carbon-cache@c service on graphite1003 is OK: OK - carbon-cache@c is active [03:55:06] RECOVERY - Check systemd state on graphite1003 is OK: OK - running: The system is fully operational [04:01:16] RECOVERY - puppet last run on mw1286 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [04:01:25] Memory also looks okay-ish [04:01:26] https://ganglia.wikimedia.org/latest/graph.php?r=4hr&z=xlarge&h=scb1001.eqiad.wmnet&m=cpu_report&s=by+name&mc=2&g=mem_report&c=Service+Cluster+B+eqiad [04:13:16] PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:20:06] RECOVERY - puppet last run on lvs1001 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [04:29:06] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:42:16] RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [04:58:06] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [05:15:36] PROBLEM - carbon-cache@e service on graphite1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@e is failed [05:16:06] PROBLEM - Check systemd state on graphite1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:18:47] With the new capacity it can take it for now (I'd add three more workers per node just for sure but meh). I have to go afk but if there is another spike, I'll make another puppet change [05:20:06] PROBLEM - puppet last run on analytics1056 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:24:36] RECOVERY - carbon-cache@e service on graphite1003 is OK: OK - carbon-cache@e is active [05:25:06] RECOVERY - Check systemd state on graphite1003 is OK: OK - running: The system is fully operational [05:25:17] 06Operations, 10Gerrit, 06Release-Engineering-Team: Gerrit: Investigate why gerrit went down for 1 minute on Feburary 5 and then again 4 minutes later - https://phabricator.wikimedia.org/T157203#2999157 (10Jay8g) [05:39:26] PROBLEM - puppet last run on scb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:49:07] RECOVERY - puppet last run on analytics1056 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [05:49:27] PROBLEM - puppet last run on prometheus2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:07:26] RECOVERY - puppet last run on scb1002 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:17:26] RECOVERY - puppet last run on prometheus2002 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:28:26] PROBLEM - Check systemd state on graphite2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:28:26] PROBLEM - carbon-cache@c service on graphite2002 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@c is failed [06:28:36] PROBLEM - carbon-cache@e service on graphite2002 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@e is failed [06:41:26] RECOVERY - Disk space on labtestnet2001 is OK: DISK OK [06:45:36] PROBLEM - carbon-cache@f service on graphite1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@f is failed [06:46:06] PROBLEM - Check systemd state on graphite1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:46:26] RECOVERY - Check systemd state on graphite2002 is OK: OK - running: The system is fully operational [06:46:26] RECOVERY - carbon-cache@c service on graphite2002 is OK: OK - carbon-cache@c is active [06:46:36] RECOVERY - carbon-cache@e service on graphite2002 is OK: OK - carbon-cache@e is active [06:47:06] PROBLEM - Disk space on elastic1022 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=90%) [06:54:36] RECOVERY - carbon-cache@f service on graphite1003 is OK: OK - carbon-cache@f is active [06:55:06] RECOVERY - Check systemd state on graphite1003 is OK: OK - running: The system is fully operational [06:55:45] (03PS1) 10Tim Landscheidt: Correct weekday in changelog entry [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/336055 (https://phabricator.wikimedia.org/T156651) [06:55:47] (03PS1) 10Tim Landscheidt: Add extended description to control [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/336056 (https://phabricator.wikimedia.org/T156651) [06:55:49] (03PS1) 10Tim Landscheidt: Generate man page for collector-runner [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/336057 (https://phabricator.wikimedia.org/T156651) [07:02:07] (03CR) 10Tim Landscheidt: "recheck" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/336056 (https://phabricator.wikimedia.org/T156651) (owner: 10Tim Landscheidt) [07:04:55] (03CR) 10Tim Landscheidt: "recheck" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/336057 (https://phabricator.wikimedia.org/T156651) (owner: 10Tim Landscheidt) [07:10:06] (03PS2) 10Tim Landscheidt: Add extended description to control [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/336056 (https://phabricator.wikimedia.org/T156651) [07:10:08] (03PS2) 10Tim Landscheidt: Generate man page for collector-runner [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/336057 (https://phabricator.wikimedia.org/T156651) [07:15:07] PROBLEM - puppet last run on fluorine is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:20:06] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2297 [07:23:16] PROBLEM - carbon-cache@b service on graphite1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@b is failed [07:24:16] RECOVERY - carbon-cache@b service on graphite1003 is OK: OK - carbon-cache@b is active [07:25:06] RECOVERY - check_mysql on frdb2001 is OK: Uptime: 484552 Threads: 1 Questions: 7077939 Slow queries: 2392 Opens: 4345 Flush tables: 1 Open tables: 574 Queries per second avg: 14.607 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [07:45:06] RECOVERY - puppet last run on fluorine is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [07:59:51] (03CR) 10Zhuyifei1999: [C: 031] "Logic looks sane" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/336049 (https://phabricator.wikimedia.org/T156626) (owner: 10Tim Landscheidt) [08:30:06] PROBLEM - Check systemd state on graphite1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:30:26] PROBLEM - puppet last run on mw1303 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:30:36] PROBLEM - carbon-cache@c service on graphite1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@c is failed [08:38:56] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=3270.20 Read Requests/Sec=3719.60 Write Requests/Sec=2.60 KBytes Read/Sec=24619.60 KBytes_Written/Sec=985.60 [08:46:56] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=1.30 Read Requests/Sec=127.20 Write Requests/Sec=121.10 KBytes Read/Sec=1256.80 KBytes_Written/Sec=963.20 [08:54:06] RECOVERY - Check systemd state on graphite1003 is OK: OK - running: The system is fully operational [08:54:36] RECOVERY - carbon-cache@c service on graphite1003 is OK: OK - carbon-cache@c is active [08:58:26] 06Operations, 10Gerrit, 06Release-Engineering-Team: Investigate why colbolt went down for 1 minute on Feburary 5 and then again 4 minutes later - https://phabricator.wikimedia.org/T157203#2999197 (10Peachey88) p:05Unbreak!>03Triage [08:58:42] 06Operations, 10Gerrit, 06Release-Engineering-Team: Investigate why cobalt went down for 1 minute on Feburary 5 and then again 4 minutes later - https://phabricator.wikimedia.org/T157203#2998958 (10Peachey88) [08:59:26] RECOVERY - puppet last run on mw1303 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [09:00:43] 06Operations, 10Gerrit, 06Release-Engineering-Team: Investigate why cobalt went down for 1 minute on Feburary 5 and then again 4 minutes later - https://phabricator.wikimedia.org/T157203#2998958 (10Peachey88) > Setting unbreak as it appears loading gerrit changes is taking longer then usual to load. Seems f... [09:18:14] (03CR) 10Volans: [C: 04-1] "One thing still missing in the puppet file, see inline." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/328352 (https://phabricator.wikimedia.org/T153680) (owner: 10Marostegui) [10:13:06] PROBLEM - Check systemd state on graphite1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:13:06] PROBLEM - carbon-cache@h service on graphite1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@h is failed [10:24:06] RECOVERY - Check systemd state on graphite1003 is OK: OK - running: The system is fully operational [10:24:06] RECOVERY - carbon-cache@h service on graphite1003 is OK: OK - carbon-cache@h is active [10:45:29] ops around? [10:45:37] probably need to increase a little more [10:47:16] PROBLEM - puppet last run on install1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:15:16] RECOVERY - puppet last run on install1001 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [11:44:35] 06Operations, 10Gerrit, 06Release-Engineering-Team: Investigate why cobalt went down for 1 minute on Feburary 5 and then again 4 minutes later - https://phabricator.wikimedia.org/T157203#2999349 (10Paladox) >>! In T157203#2999206, @Peachey88 wrote: >> Setting unbreak as it appears loading gerrit changes is t... [11:46:36] PROBLEM - carbon-cache@e service on graphite1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@e is failed [11:47:06] PROBLEM - Check systemd state on graphite1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:54:36] RECOVERY - carbon-cache@e service on graphite1003 is OK: OK - carbon-cache@e is active [11:55:06] RECOVERY - Check systemd state on graphite1003 is OK: OK - running: The system is fully operational [11:57:54] 06Operations, 10Gerrit, 06Release-Engineering-Team: Investigate why cobalt went down for 1 minute on Feburary 5 and then again 4 minutes later - https://phabricator.wikimedia.org/T157203#2999354 (10Paladox) Cobalt semed to be having higher then normal cpu levels, it's showing the system using a lot of cpu. (... [11:59:56] 06Operations, 10Gerrit, 06Release-Engineering-Team: Investigate why cobalt went down for 1 minute on Feburary 5 and then again 4 minutes later - https://phabricator.wikimedia.org/T157203#2999355 (10Paladox) p:05Triage>03High Changing to high as needs investigation as soon as possible. @Peachey88 if you c... [12:03:26] PROBLEM - puppet last run on rcs1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:08:06] PROBLEM - puppet last run on elastic1024 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [12:11:06] PROBLEM - puppet last run on elastic1026 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [12:18:06] PROBLEM - puppet last run on elastic1040 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [12:32:26] RECOVERY - puppet last run on rcs1002 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [12:34:42] hello, could someone help me mount the dumps share on my nova instance? I made a ticket on phabricator https://phabricator.wikimedia.org/T156586 [12:37:06] PROBLEM - puppet last run on elastic1022 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [12:44:16] PROBLEM - puppet last run on cp1049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:13:16] RECOVERY - puppet last run on cp1049 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [13:30:06] PROBLEM - Check systemd state on graphite1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:30:36] PROBLEM - carbon-cache@c service on graphite1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@c is failed [13:39:46] (03PS14) 10Marostegui: Reporting tests with the private data script [puppet] - 10https://gerrit.wikimedia.org/r/328352 (https://phabricator.wikimedia.org/T153680) [13:40:47] (03CR) 10Marostegui: Reporting tests with the private data script (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/328352 (https://phabricator.wikimedia.org/T153680) (owner: 10Marostegui) [13:54:36] RECOVERY - carbon-cache@c service on graphite1003 is OK: OK - carbon-cache@c is active [13:55:06] RECOVERY - Check systemd state on graphite1003 is OK: OK - running: The system is fully operational [13:56:16] PROBLEM - puppet last run on labvirt1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:58:36] RECOVERY - MariaDB Slave Lag: s5 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 89969.22 seconds [14:25:16] RECOVERY - puppet last run on labvirt1004 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [15:07:53] 06Operations, 10Gerrit, 06Release-Engineering-Team: Investigate why cobalt went down for 1 minute on Feburary 5 and then again 4 minutes later - https://phabricator.wikimedia.org/T157203#2999445 (10Dzahn) > [01:06:17] PROBLEM - configured eth on cobalt is CRITICAL: CHECK_NRPE: Socket timeout aft... [15:11:45] 06Operations, 10Gerrit, 06Release-Engineering-Team: Investigate why cobalt went down for 1 minute on Feburary 5 and then again 4 minutes later - https://phabricator.wikimedia.org/T157203#2999446 (10Paladox) >>! In T157203#2999445, @Dzahn wrote: >> [01:06:17] PROBLEM - configured eth on cobalt is... [15:32:06] PROBLEM - Check systemd state on graphite1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:32:16] PROBLEM - carbon-cache@d service on graphite1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@d is failed [15:51:06] 06Operations, 10IDS-extension, 10Wikimedia-Extension-setup, 07I18n: Deploy IDS rendering engine to production - https://phabricator.wikimedia.org/T148693#2999457 (10Shoichi) a:05Shoichi>03awight Hello awight, about code review (IDS render server) , do you know who can do it? My translation team have... [15:54:06] RECOVERY - Check systemd state on graphite1003 is OK: OK - running: The system is fully operational [15:54:26] RECOVERY - carbon-cache@d service on graphite1003 is OK: OK - carbon-cache@d is active [16:04:36] PROBLEM - puppet last run on phab2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:33:36] RECOVERY - puppet last run on phab2001 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [17:33:36] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:02:36] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [18:05:16] PROBLEM - puppet last run on db1038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:34:16] RECOVERY - puppet last run on db1038 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [19:35:06] PROBLEM - puppet last run on analytics1055 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:38:26] PROBLEM - puppet last run on conf1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:03:06] RECOVERY - puppet last run on analytics1055 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [20:03:10] (03PS3) 10Paladox: Update npm to 2.x and nodejs to 4.x [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/303370 [20:06:26] RECOVERY - puppet last run on conf1002 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [20:52:26] PROBLEM - puppet last run on ms-be1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:05:26] PROBLEM - puppet last run on labvirt1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:13:16] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:20:26] RECOVERY - puppet last run on ms-be1009 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [21:26:46] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=80%) [21:27:41] 06Operations, 10Gerrit, 06Release-Engineering-Team: Investigate why cobalt went down for 1 minute on Feburary 5 and then again 4 minutes later - https://phabricator.wikimedia.org/T157203#2999739 (10hashar) The huge User CPU spike around 21:00UTC is me doing maintenance on Zuul git repositories. Went on scand... [21:34:26] RECOVERY - puppet last run on labvirt1008 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [21:42:16] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [21:48:57] (03PS1) 10Tim Landscheidt: labstore: Remove create-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/336157 [21:56:06] PROBLEM - puppet last run on mw1260 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:21:26] PROBLEM - puppet last run on kafka1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:25:06] RECOVERY - puppet last run on mw1260 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [22:46:46] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 260 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [22:47:35] 06Operations, 10Gerrit, 06Release-Engineering-Team: Investigate why cobalt went down for 1 minute on Feburary 5 and then again 4 minutes later - https://phabricator.wikimedia.org/T157203#2999835 (10Paladox) >>! In T157203#2999739, @hashar wrote: > The huge User CPU spike around 21:00UTC is me doing maintenan... [22:50:26] RECOVERY - puppet last run on kafka1013 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [22:51:46] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 12 probes of 260 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [22:55:16] PROBLEM - puppet last run on db1073 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:59:26] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed [22:59:46] PROBLEM - MD RAID on ms-be1012 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 2, Spare: 0 [22:59:47] ACKNOWLEDGEMENT - MD RAID on ms-be1012 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 2, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T157237 [22:59:50] 06Operations, 10ops-eqiad: Degraded RAID on ms-be1012 - https://phabricator.wikimedia.org/T157237#2999842 (10ops-monitoring-bot) [23:00:06] PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:09:26] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - maintain-dbusers is active [23:10:06] RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational [23:23:16] RECOVERY - puppet last run on db1073 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [23:35:16] PROBLEM - puppet last run on db1038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:40:46] PROBLEM - puppet last run on db1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:41:16] !log truncating elasticsearch logs on elastic1022 - T139043 [23:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:22] T139043: nested RemoteTransportExceptions filled the disk on elastic1036 and elastic1045 during a rolling restart - https://phabricator.wikimedia.org/T139043 [23:42:06] RECOVERY - Disk space on elastic1022 is OK: DISK OK [23:42:30] !log truncating elasticsearch logs on elastic10(24|26|40) - T139043 [23:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:39] 06Operations, 10ops-eqiad: Degraded RAID on ms-be1012 - https://phabricator.wikimedia.org/T157237#2999842 (10Volans) Puppet failing too, I've ack'ed the alarm: ``` Error: xfs_admin -L swift-sdn3 /dev/sdn3 returned 1 instead of one of [0] Error: /Stage[main]/Role::Swift::Storage/Swift::Label_filesystem[/dev/sdn... [23:43:06] ACKNOWLEDGEMENT - puppet last run on ms-be1012 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[xfs_label-/dev/sdn3] Volans Broken disk: https://phabricator.wikimedia.org/T157237 [23:44:16] RECOVERY - Disk space on elastic1024 is OK: DISK OK [23:44:36] RECOVERY - Disk space on elastic1040 is OK: DISK OK [23:44:46] RECOVERY - Disk space on elastic1026 is OK: DISK OK [23:46:13] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/328352 (https://phabricator.wikimedia.org/T153680) (owner: 10Marostegui) [23:46:36] gehel: thanks!!! [23:47:19] :) [23:48:06] RECOVERY - puppet last run on elastic1040 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures