[00:00:06] urandom try sshing now? [00:00:08] RECOVERY - Host xenon is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms [00:00:27] RECOVERY - MegaRAID on xenon is OK: OK: no disks configured for RAID [00:00:37] YuviPanda: yeah, but it's the same thing as before [00:00:38] RECOVERY - dhclient process on xenon is OK: PROCS OK: 0 processes with command name dhclient [00:00:38] RECOVERY - salt-minion processes on xenon is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:00:46] acpi_pad is using a lot of cpu [00:00:58] RECOVERY - DPKG on xenon is OK: All packages OK [00:01:05] YuviPanda: do you think it's OK if i try unloading that kernel module? [00:01:07] RECOVERY - Disk space on xenon is OK: DISK OK [00:01:15] YuviPanda: https://www-947.ibm.com/support/entry/portal/docdisplay?lndocid=migr-5098951 [00:02:07] I'm going to ask forgiveness rather than permission on this one [00:02:58] cool; it worked [00:02:59] urandom yup [00:03:19] urandom and open a ticket? [00:03:27] YuviPanda: will do. [00:03:35] thanks [00:03:40] YuviPanda: thank you [00:10:18] RECOVERY - cassandra-a service on xenon is OK: OK - cassandra-a is active [00:11:17] RECOVERY - Restbase root url on xenon is OK: HTTP OK: HTTP/1.1 200 - 15273 bytes in 1.297 second response time [00:13:18] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [00:14:40] urandom: just got back. i see it recovered. that bug talks about "if HT is disabled" so i checked if it is on xenon. enabled there though [00:14:54] yeah [00:15:56] mutante: it's wierd [00:15:58] 06Operations, 10Cassandra: xenon.eqiad.wmnet: very high cpu utilization - https://phabricator.wikimedia.org/T141675#2507314 (10Eevans) [00:16:02] 06Operations, 10Cassandra: xenon.eqiad.wmnet: very high cpu utilization - https://phabricator.wikimedia.org/T141675#2507327 (10Eevans) p:05Triage>03High [00:16:09] RECOVERY - puppet last run on xenon is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [00:16:11] mutante, YuviPanda: ^^^ [00:16:39] should be good enough going into the weekend though [00:16:51] thanks urandom [00:17:01] yea, thanks, ack [00:17:03] it's not production, but i don't want it creating pager fatigue [00:17:21] :) es [00:17:23] yes [00:19:38] RECOVERY - cassandra-a CQL 10.64.0.202:9042 on xenon is OK: TCP OK - 0.004 second response time on port 9042 [00:23:53] 06Operations, 10Cassandra: xenon.eqiad.wmnet: very high cpu utilization - https://phabricator.wikimedia.org/T141675#2507314 (10Dzahn) yea, HT is enabled on xenon.. it seems to start here, when RT throttling gets activated 2680 Jul 29 22:23:40 xenon kernel: [10997327.180547] sched: RT throttling activated 268... [00:32:08] PROBLEM - puppet last run on mw1177 is CRITICAL: CRITICAL: Puppet has 1 failures [00:56:37] 06Operations, 10Cassandra: xenon.eqiad.wmnet: very high cpu utilization - https://phabricator.wikimedia.org/T141675#2507368 (10Eevans) [00:57:47] RECOVERY - puppet last run on mw1177 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [02:20:18] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.12) (duration: 08m 10s) [02:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:25:55] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Jul 30 02:25:55 UTC 2016 (duration 5m 37s) [02:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:37:48] PROBLEM - puppet last run on dbstore2001 is CRITICAL: CRITICAL: puppet fail [02:47:38] RECOVERY - cassandra-c CQL 10.192.48.56:9042 on restbase2009 is OK: TCP OK - 0.036 second response time on port 9042 [02:48:58] LALAALLLAALALLA [02:49:01] LALAALLLAALALLA [02:49:02] LALAALLLAALALLA [02:49:04] LALAALLLAALALLA [02:49:05] LALAALLLAALALLA [02:49:06] LALAALLLAALALLA [02:49:07] LALAALLLAALALLA [02:49:08] LALAALLLAALALLA [02:49:10] LALAALLLAALALLA [02:49:11] LALAALLLAALALLA [02:49:12] LALAALLLAALALLA [02:49:13] LALAALLLAALALLA [02:49:14] LALAALLLAALALLA [02:49:15] LALAALLLAALALLA [02:49:17] LALAALLLAALALLA [02:49:22] LALAALLLAALALLA [02:49:23] LALAALLLAALALLA [02:49:28] LALAALLLAALALLA [02:49:29] LALAALLLAALALLA [02:49:30] LALAALLLAALALLA [02:49:31] LALAALLLAALALLA [02:49:32] LALAALLLAALALLA [02:49:34] LALAALLLAALALLA [02:49:35] LALAALLLAALALLA [02:49:36] LALAALLLAALALLA [02:49:38] CHAU [03:03:48] RECOVERY - puppet last run on dbstore2001 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [05:44:37] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 222, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [06:29:57] PROBLEM - puppet last run on pc1006 is CRITICAL: CRITICAL: puppet fail [06:31:08] PROBLEM - puppet last run on mw2228 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:09] PROBLEM - puppet last run on wtp2008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:18] PROBLEM - puppet last run on ms-be2022 is CRITICAL: CRITICAL: Puppet has 4 failures [06:31:18] PROBLEM - puppet last run on ms-be2026 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:28] PROBLEM - puppet last run on db1046 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:08] PROBLEM - puppet last run on restbase2006 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:58] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:58] PROBLEM - puppet last run on analytics1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:56:28] RECOVERY - puppet last run on wtp2008 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:56:29] RECOVERY - puppet last run on ms-be2022 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:56:29] RECOVERY - puppet last run on ms-be2026 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:56:37] RECOVERY - puppet last run on db1046 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:48] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 224, down: 0, dormant: 0, excluded: 0, unused: 0 [06:57:08] RECOVERY - puppet last run on pc1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:18] RECOVERY - puppet last run on restbase2006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:08] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:19] RECOVERY - puppet last run on mw2228 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:02:27] RECOVERY - puppet last run on analytics1042 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:15:27] PROBLEM - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.65.0.24 [07:17:09] RECOVERY - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [08:52:49] RECOVERY - cassandra-c CQL 10.64.48.137:9042 on restbase1014 is OK: TCP OK - 0.006 second response time on port 9042 [09:05:49] 06Operations: reinstall snapshot1001.eqiad.wmnet with RAID, decomm snapshot1002,3,4 - https://phabricator.wikimedia.org/T140439#2507574 (10ArielGlenn) [09:10:47] 06Operations, 10Dumps-Generation, 07HHVM, 13Patch-For-Review: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#2507578 (10ArielGlenn) [09:14:53] 06Operations, 10Datasets-General-or-Unknown: reinstall snapshot1001.eqiad.wmnet with RAID, decomm snapshot1002,3,4 - https://phabricator.wikimedia.org/T140439#2507594 (10ArielGlenn) [12:49:39] PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: Puppet has 1 failures [12:53:59] (03CR) 10Nemo bis: [C: 031] "Certainly ok as first step." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301893 (https://phabricator.wikimedia.org/T131340) (owner: 10Jforrester) [13:00:58] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [13:08:58] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [13:08:58] PROBLEM - HP RAID on ms-be1024 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [13:12:48] RECOVERY - HP RAID on ms-be1024 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [13:15:27] RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:22:38] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [13:24:38] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [13:30:28] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [13:38:37] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [13:40:28] PROBLEM - HP RAID on ms-be1024 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [13:42:27] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [13:44:27] RECOVERY - HP RAID on ms-be1024 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [13:44:27] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [13:46:18] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [13:50:17] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [13:54:09] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [13:55:30] PROBLEM - puppet last run on es2012 is CRITICAL: CRITICAL: puppet fail [13:56:07] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [14:02:09] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [14:04:17] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [14:13:59] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [14:19:48] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [14:19:57] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [14:21:08] RECOVERY - puppet last run on es2012 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [14:21:48] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [14:27:35] (03PS4) 10MarcoAurelio: Expanding throttle limits for enwiki Edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301761 (https://phabricator.wikimedia.org/T141421) [14:27:39] PROBLEM - HP RAID on ms-be1024 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [14:27:39] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [14:27:40] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [14:28:18] (03CR) 10MarcoAurelio: [C: 04-1] "Per task." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299354 (https://phabricator.wikimedia.org/T140550) (owner: 10Kharkiv07) [14:29:37] RECOVERY - HP RAID on ms-be1024 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [14:31:48] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [14:31:57] PROBLEM - HP RAID on ms-be1025 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [14:33:57] RECOVERY - HP RAID on ms-be1025 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [14:35:47] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [14:37:17] PROBLEM - dhclient process on analytics1045 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:37:27] PROBLEM - configured eth on analytics1045 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:37:47] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [14:37:47] PROBLEM - Disk space on Hadoop worker on analytics1045 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:37:50] PROBLEM - Check size of conntrack table on analytics1045 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:37:50] PROBLEM - salt-minion processes on analytics1045 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:38:09] PROBLEM - puppet last run on analytics1045 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:38:09] PROBLEM - Hadoop DataNode on analytics1045 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:38:12] PROBLEM - Disk space on analytics1045 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:38:37] PROBLEM - YARN NodeManager Node-State on analytics1045 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:38:41] PROBLEM - DPKG on analytics1045 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:38:48] PROBLEM - MegaRAID on analytics1045 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:38:58] PROBLEM - Hadoop NodeManager on analytics1045 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:40:27] RECOVERY - YARN NodeManager Node-State on analytics1045 is OK: OK: YARN NodeManager analytics1045.eqiad.wmnet:8041 Node-State: RUNNING [14:40:31] RECOVERY - DPKG on analytics1045 is OK: All packages OK [14:40:48] RECOVERY - MegaRAID on analytics1045 is OK: OK: optimal, 13 logical, 14 physical [14:40:48] RECOVERY - Hadoop NodeManager on analytics1045 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [14:41:09] PROBLEM - puppet last run on mw2224 is CRITICAL: CRITICAL: puppet fail [14:41:17] RECOVERY - dhclient process on analytics1045 is OK: PROCS OK: 0 processes with command name dhclient [14:41:19] RECOVERY - configured eth on analytics1045 is OK: OK - interfaces up [14:41:49] RECOVERY - Disk space on Hadoop worker on analytics1045 is OK: DISK OK [14:41:51] RECOVERY - Check size of conntrack table on analytics1045 is OK: OK: nf_conntrack is 0 % full [14:41:51] RECOVERY - salt-minion processes on analytics1045 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:41:58] RECOVERY - puppet last run on analytics1045 is OK: OK: Puppet is currently enabled, last run 13 minutes ago with 0 failures [14:41:58] RECOVERY - Hadoop DataNode on analytics1045 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [14:42:08] RECOVERY - Disk space on analytics1045 is OK: DISK OK [14:45:14] (03CR) 10MarcoAurelio: [C: 031] "Looks good to me. Needs to be rebased though." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301807 (https://phabricator.wikimedia.org/T140566) (owner: 10Dereckson) [14:55:37] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [15:01:38] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [15:06:17] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [15:07:29] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [15:08:09] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4933024 keys - replication_delay is 0 [15:08:47] RECOVERY - puppet last run on mw2224 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:09:28] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [15:09:37] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [15:13:27] PROBLEM - HP RAID on ms-be1024 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [15:13:33] (03PS1) 10BBlack: openssl (1.0.2h-1~wmf3) jessie-wikimedia; urgency=medium [debs/openssl] - 10https://gerrit.wikimedia.org/r/301920 (https://phabricator.wikimedia.org/T131908) [15:15:19] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [15:17:18] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [15:23:08] RECOVERY - HP RAID on ms-be1024 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [15:25:07] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [15:31:17] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [15:35:07] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [15:35:08] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [15:41:08] PROBLEM - HP RAID on ms-be1024 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [15:42:58] RECOVERY - HP RAID on ms-be1024 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [15:42:58] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [15:44:57] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [15:47:56] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me!" [debs/openssl] - 10https://gerrit.wikimedia.org/r/301903 (owner: 10BBlack) [15:50:48] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [15:52:47] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [15:56:48] PROBLEM - HP RAID on ms-be1024 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [16:02:47] RECOVERY - HP RAID on ms-be1024 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [16:02:48] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [16:04:48] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [16:16:37] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [16:20:29] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [16:26:27] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [16:30:27] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [16:32:37] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [16:34:13] (03PS2) 10BBlack: openssl (1.0.2h-1~wmf3) jessie-wikimedia; urgency=medium [debs/openssl] - 10https://gerrit.wikimedia.org/r/301920 (https://phabricator.wikimedia.org/T131908) [16:34:28] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [16:44:18] PROBLEM - HP RAID on ms-be1025 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [16:46:17] RECOVERY - HP RAID on ms-be1025 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [16:46:18] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [16:47:06] 06Operations, 10MediaWiki-Cache, 10Traffic: Possible increase in logged-out users being served cached outdated revisions - https://phabricator.wikimedia.org/T141693#2508537 (10Glaisher) [16:47:19] 06Operations, 10MediaWiki-Cache, 10Traffic: Possible increase in logged-out users being served cached outdated revisions - https://phabricator.wikimedia.org/T141693#2508549 (10Glaisher) p:05Triage>03High [16:49:27] 06Operations, 10MediaWiki-Cache, 10Traffic: Cached outdated revisions served to logged-out users - https://phabricator.wikimedia.org/T141687#2508550 (10Aklapper) [16:49:42] 06Operations, 10MediaWiki-Cache, 10Traffic: Cached outdated revisions served to logged-out users - https://phabricator.wikimedia.org/T141687#2507607 (10Aklapper) [16:49:45] 06Operations, 10MediaWiki-Cache, 10Traffic: Possible increase in logged-out users being served cached outdated revisions - https://phabricator.wikimedia.org/T141693#2508558 (10Aklapper) [16:52:50] 06Operations, 10MediaWiki-Cache, 10Traffic: Cached outdated revisions served to logged-out users - https://phabricator.wikimedia.org/T141687#2508560 (10Aklapper) Quoting Glaisher from T141693: > Multiple reports at enwiki and OTRS. > * https://en.wikipedia.org/wiki/Wikipedia:Help_desk#Why_No_Text_In_Article.... [16:53:22] 06Operations, 10MediaWiki-Cache, 10Traffic: Cached outdated revisions served to logged-out users - https://phabricator.wikimedia.org/T141687#2508563 (10Aklapper) (Wondering if belated syncing of ProofRead status on Wikisource reported in T141692 might be related to caching issues.) [16:54:49] 06Operations, 10MediaWiki-Cache, 10Traffic: Cached outdated revisions served to logged-out users - https://phabricator.wikimedia.org/T141687#2508566 (10Boshomi) [16:56:32] 06Operations, 10MediaWiki-Cache, 10Traffic: Cached outdated revisions served to logged-out users - https://phabricator.wikimedia.org/T141687#2508567 (10Glaisher) >>! In T141687#2508563, @Aklapper wrote: > (Wondering if belated syncing of ProofRead status on Wikisource reported in T141692 might be related to... [16:59:58] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [16:59:58] PROBLEM - HP RAID on ms-be1025 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [16:59:58] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [17:02:08] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [17:06:07] RECOVERY - HP RAID on ms-be1025 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [17:11:00] 06Operations, 10Phabricator, 06Project-Admins, 06Triagers: Requests for addition to the #acl*Project-Admins group (in comments) - https://phabricator.wikimedia.org/T706#2508576 (10Danny_B) [17:11:59] 06Operations, 10Phabricator, 06Project-Admins, 06Triagers: Requests for addition to the #acl*Project-Admins group (in comments) - https://phabricator.wikimedia.org/T706#1722432 (10Danny_B) [17:15:58] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [17:30:59] 06Operations, 06WMF-NDA-Requests: Please add me to #WMF-NDA - https://phabricator.wikimedia.org/T94238#2508587 (10Danny_B) [17:31:48] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [17:31:48] PROBLEM - HP RAID on ms-be1024 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [17:31:48] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [17:33:48] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [17:35:49] RECOVERY - HP RAID on ms-be1024 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [17:39:48] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [17:41:47] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [17:45:38] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [17:53:28] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [17:55:27] PROBLEM - HP RAID on ms-be1024 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [17:57:18] RECOVERY - HP RAID on ms-be1024 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [17:59:18] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [18:04:48] PROBLEM - puppet last run on mw1246 is CRITICAL: CRITICAL: Puppet has 1 failures [18:07:27] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [18:09:18] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [18:13:17] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [18:15:17] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [18:20:58] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [18:24:58] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [18:26:57] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [18:26:58] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [18:28:57] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [18:32:18] RECOVERY - puppet last run on mw1246 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:34:59] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [18:36:12] (03Abandoned) 10Tpt: Deploy the Kartographer extension to meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298042 (https://phabricator.wikimedia.org/T139787) (owner: 10Tpt) [18:36:53] 06Operations, 10MediaWiki-Cache, 10Traffic: Cached outdated revisions served to logged-out users - https://phabricator.wikimedia.org/T141687#2508701 (10Boshomi) T141695 is also the same [18:38:57] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [18:38:58] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [18:40:06] 06Operations, 10MediaWiki-Cache, 10Traffic: Cached outdated revisions served to logged-out users - https://phabricator.wikimedia.org/T141687#2508709 (10Boshomi) [18:41:45] 06Operations, 10MediaWiki-Cache, 10Traffic: Cached outdated revisions served to logged-out users - https://phabricator.wikimedia.org/T141687#2507607 (10Boshomi) in T141695 @Gestrid wrote: >I work in the English Wikipedia's Teahouse (a place for new users to ask questions and get answers), where we have recen... [18:42:49] PROBLEM - HP RAID on ms-be1025 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [18:44:47] RECOVERY - HP RAID on ms-be1025 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [18:46:38] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [19:06:28] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [19:10:28] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [19:12:18] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [19:16:21] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [19:18:17] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [19:18:19] PROBLEM - HP RAID on ms-be1025 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [19:20:09] RECOVERY - HP RAID on ms-be1025 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [19:26:09] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [19:27:58] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [19:28:06] 06Operations, 10MediaWiki-Cache, 10Traffic: Cached outdated revisions served to logged-out users - https://phabricator.wikimedia.org/T141687#2508719 (10Gestrid) Several reports have come in at the [[ https://en.wikipedia.org/wiki/Wikipedia:Teahouse/Questions | English Wikipedia's Teahouse ]] regarding this i... [19:32:17] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [19:34:25] 06Operations, 10MediaWiki-Cache, 10Traffic: Cached outdated revisions served to logged-out users - https://phabricator.wikimedia.org/T141687#2508720 (10Boshomi) [19:40:00] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [19:43:49] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [19:43:58] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [19:53:38] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [19:53:48] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [20:01:57] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [20:03:57] PROBLEM - HP RAID on ms-be1024 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [20:05:48] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [20:07:48] RECOVERY - HP RAID on ms-be1024 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [20:10:58] 06Operations, 10MediaWiki-Cache, 10Traffic: Logged out users are not seeing the most up-to-date versions of Wikipedia pages - https://phabricator.wikimedia.org/T141695#2508730 (10Danny_B) [20:13:39] PROBLEM - HP RAID on ms-be1025 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [20:15:28] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [20:15:37] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [20:15:38] RECOVERY - HP RAID on ms-be1025 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [20:17:29] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [20:23:27] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [20:25:19] PROBLEM - HP RAID on ms-be1024 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [20:27:18] RECOVERY - HP RAID on ms-be1024 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [20:27:18] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [20:35:27] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [20:39:19] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [20:41:11] um, I'm just peeking in, does anyone know what is going on with these HP RAID warnings/ [20:41:17] ? [20:41:18] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [20:42:36] Not seeen anyone mention anything [20:44:06] it's ms-be1023,4,5,6 [20:45:17] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [20:45:38] seems to have been going on since about 4pm my time (EET) [20:45:43] so that's 1pm? UTC? [20:46:14] aye [20:48:43] all but one ms-be host (and there are 58 right now) is salt-responsive, I wonder why these 4 are seeming to have the issue [20:49:07] I think they're "new" [20:49:28] https://phabricator.wikimedia.org/T136631 [20:49:33] rack/setup/deploy ms-be102[2-7] [20:50:27] T140374 diagnose failed disks on ms-be1027 [20:50:28] T140374: diagnose failed disks on ms-be1027 - https://phabricator.wikimedia.org/T140374 [20:50:45] diagnose failed(?) sda on ms-be1022 [20:50:51] T140597 [20:50:51] T140597: diagnose failed(?) sda on ms-be1022 - https://phabricator.wikimedia.org/T140597 [20:51:07] (03PS1) 10Matanya: allow sysops on hewikt to remove autopatroller and patroller rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302054 [20:51:08] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [20:51:38] https://phabricator.wikimedia.org/T136631 [20:51:45] oh, you already pasted it in [20:51:54] yeh I had come to the same conclusion, these are "new"-ish [20:52:00] 06Operations, 10ops-eqiad, 10media-storage, 13Patch-For-Review: rack/setup/deploy ms-be102[2-7] - https://phabricator.wikimedia.org/T136631#2341913 (10Reedy) These all seem to be flapping with ``` [21:27:18] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 second... [20:52:26] The checklists look fairly out of date :) [20:53:05] 08:49 godog: swift eqiad-prod: ms-be102[3456] weight 3000 on the 28th [20:53:13] indeed they do [20:54:03] godog: any chance you're around? [20:54:12] so unlikely, saturday night. I shouldn't even be around :-P [20:56:14] ganglia looks pretty boring for those 4 [20:56:26] false positive and such [21:04:32] 06Operations, 10ops-eqiad, 10media-storage, 13Patch-For-Review: rack/setup/deploy ms-be102[2-7] - https://phabricator.wikimedia.org/T136631#2341913 (10ArielGlenn) Jul 30 18:18:38 ms-be1026 kernel: [1391836.163876] hpsa 0000:08:00.0: scsi 0:0:0:0 Aborting command ffff8812d751b4c0Tag:0x00000000:00000310 CD... [21:04:32] sort of [21:04:38] I mean it's probably kernel [21:04:43] see ticket ^ [21:05:08] I should see if the hosts that aren't new have these issues [21:05:08] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [21:05:13] geethanks [21:05:45] I wonder if those were the hosts that paravoid among others were chasing various firmware updates [21:06:23] nope they don't [21:06:24] just these [21:06:28] maybe [21:06:46] anyways now it's on a ticket someplace [21:08:23] it appears there is no user impact [21:08:28] so yay for that [21:10:59] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [21:11:39] Is it worth ack-ing in incinga? [21:11:47] /schedule downtime [21:13:22] well it's not paging [21:14:02] I'd rather let it visibly flap, if someone knows better they can ack it [21:16:02] by 'knows better' I mean they have been working on it and know what attention the icinga whines need [21:16:03] aha, ok [21:16:15] I guess if it was paging, someone would've come in earlier to shut up or fix it [21:16:19] yep [21:16:33] seriously, I just was peeking in on irc and saw these, otherwise I would not know [21:18:59] heh [21:22:38] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [21:32:47] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [21:34:25] /49/6 [21:36:38] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [21:38:34] Reedy: feel like reviewing https://gerrit.wikimedia.org/r/#/c/302054/ ? [21:40:29] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [21:40:29] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [21:50:18] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [21:52:17] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [22:00:07] PROBLEM - HP RAID on ms-be1024 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [22:02:08] RECOVERY - HP RAID on ms-be1024 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [22:07:42] JEM [22:07:43] JEM [22:07:44] JEM [22:07:45] JEM [22:07:46] JEM [22:07:48] T [22:07:49] E [22:07:51] - [22:07:52] M [22:07:54] A [22:07:55] T [22:07:56] A [22:07:57] R [22:07:58] E [22:08:01] - [22:08:03] DHAHAHAHAHAH [22:08:08] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [22:08:09] JEM TE MATARE [22:08:11] HAHAHAHA [22:08:16] JEM HIJO DE PERRA [22:08:41] Spam [22:08:47] JEM [22:08:49] MALDITO [22:09:00] CALLATE PALADOX DE MIERDA [22:09:23] HIJO DE PERRA.. ME ACUSAS Y TEMATI [22:09:28] MATARE [22:09:42] @kb 185.140.114.121 [22:09:42] Permission denied [22:10:41] LARGATE DE AQUI PALADOX [22:10:43] PUDRETE [22:10:57] sigh [22:11:12] Italian [22:11:17] Largate DE AQUI PALADOX [22:11:22] What does that even mean [22:11:45] Portuguise now [22:12:38] mhh gone heh [22:12:42] it's Spanish [22:12:54] largate = go away [22:13:06] it's a known troll [22:13:59] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [22:14:18] Reedy apergos: sigh, looks like the controller isn't very happy (swift) thanks for looking ! [22:14:42] godog: is this something you have known about? [22:15:27] Oh [22:15:39] thank Platonides for explaning what it means :) [22:15:43] apergos: I've seen the check timing out before yeah [22:15:57] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [22:16:01] ok [22:19:18] Platonides, im guessing the person who did that did not know i only speak english and not spanish so i would not know what they were saying [22:19:20] lol [22:19:39] It also swore at me. [22:19:58] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [22:20:04] yes [22:20:09] also I'm silencing those, no point in spam [22:20:15] he has been attacking us on many channels for weeks [22:20:23] oh [22:20:34] It is very hard to ban them it seems [22:20:40] they keep changing ip and username [22:21:58] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [22:23:06] yes [22:23:17] oh [22:23:30] it said something to do with jem kill you [22:23:32] I didn't feel right about acking them until I knew that someone with a clue was on the case [22:23:34] that is translated from [22:23:44] JEM HIJO DE PERRA [22:23:50] thanks for patting icinga on the head [22:25:23] It is problay going to come back at sometime in the earley morning bst time. [22:25:28] like it did this morning [22:27:10] LOL it just changed its ip by one UAWIKI| (~Javiera@185.140.114.122) from this morning and wpayuda272626 (~cremosa@185.140.114.121) has joined now [22:27:57] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [22:34:31] apergos: yeah I'm taking a look, I agree it doesn't seem to have an impact tho [22:35:44] so far so good at least [22:45:58] 06Operations, 10MediaWiki-Cache, 10Traffic: Cached outdated revisions served to logged-out users - https://phabricator.wikimedia.org/T141687#2508947 (10BBlack) There haven't been any related changes recently on the cache side of things (e.g. changes in relevant VCL or how purging works, etc). More likely it... [22:51:57] 06Operations, 10ops-eqiad, 10media-storage, 13Patch-For-Review: rack/setup/deploy ms-be102[2-7] - https://phabricator.wikimedia.org/T136631#2508951 (10fgiunchedi) it looks like the controller isn't responding under load and makes the icinga alarms flap. Given that this is new hardware and we have the same... [22:52:07] 06Operations, 10MediaWiki-Cache, 10Traffic: Cached outdated revisions served to logged-out users - https://phabricator.wikimedia.org/T141687#2508952 (10Gestrid) The first question in the Teahouse related to this problem came in yesterday (July 29th) at about 2:16 pm EDT. This means that whatever happened wa... [23:26:03] (03PS3) 10BBlack: openssl (1.0.2h-1~wmf3) jessie-wikimedia; urgency=medium [debs/openssl] - 10https://gerrit.wikimedia.org/r/301920 (https://phabricator.wikimedia.org/T131908) [23:33:18] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [23:35:09] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4913580 keys - replication_delay is 0 [23:36:45] !log deleted 7 files from server for legal compliance [23:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:43:19] Jamesofur: vague much [23:43:38] Reedy: not really, there is literally only 1 reason I ever do that :) [23:44:02] even saying which server it is might be more useful ;) [23:45:14] Reedy: I don't know which server it is! The magic eraseArchivedFile.php goes and finds the files I needs and purges the caches etc ;) and if I say what wiki people could go looking for my logs which is #notgood in this case ;) [23:46:19] oh, so you deleted media? [23:46:22] * Jamesofur nods [23:47:30] think there's a better way to word it for the future? I can adjust the internal docs we have for the procedure. I mostly log for the unlikely possibility that something fucks up we have a time stamp (also because for this particular procedure I REALLY like having time stamps) [23:48:22] I think calling them uploads... might make it clearer [23:48:34] we have a clearer definition of what they actually are [23:48:41] vs "files" which could be anything [23:49:02] or file uploads.. [23:49:14] ahhh, that's a fair point [23:49:19] hadn't even thought about that angle [23:50:06] (because you're right if I had deleted a log or something for legal compliance, which I've done before, I totally would have said what server etc) [23:53:11] Not a big deal, just makes it clearer rather than "Jamesofur deleted a random file on a random server" ;) [23:53:17] ;) yup [23:53:25] Another way.. would say deleting file from swift [23:53:31] Again, that's specific and fairly obvious [23:53:37] * Jamesofur nods [23:54:08] but only to those who know what they're talking about which isn't a bad way to do it [23:54:17] Yeah, indeed [23:54:33] People who know more than the average user know what you're doing [23:54:37] Rather than wtf-ing at it [23:55:33] * Jamesofur nods [23:55:47] yup, wrote it down in our docs :) appreciate it [23:56:06] cool, no :) [23:56:08] *np [23:56:31] "Used eraseArchivedFile.php to remove 7 files for legal compliance" [23:56:53] having a timestamp for destructive operations is probably a good thing [23:57:31] Yeah, or that [23:57:41] e.g. when user rights change logging broke we could use the times to help the DBA find all the rights changes that weren't accounted for [23:57:47] "Why did Jamesofur just set off the chaos monkey?" [23:58:47] although in that case we had the data necessary to reconstruct almost everything (except log summaries IIRC) [23:58:49] I guess this script actually properly deletes things completely, unlike on-wiki deletion [23:58:57] * Jamesofur nods [23:58:58] yup [23:59:18] Reedy: why WOULDN'T I set off the chaos monkey? [23:59:20] ;)