[01:41:18] RECOVERY - Memory correctable errors -EDAC- on scb1002 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=scb1002&var-datasource=eqiad%2520prometheus%252Fops [03:33:48] PROBLEM - puppet last run on mw2234 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-ISP.mmdb.gz] [04:04:27] RECOVERY - puppet last run on mw2234 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [04:10:53] (03PS1) 10TerraCodes: Finish $wmfRealm to $wmgRealm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444425 (https://phabricator.wikimedia.org/T45956) [04:27:08] PROBLEM - Debian mirror in sync with upstream on sodium is CRITICAL: /srv/mirrors/debian is over 14 hours old. [04:47:58] RECOVERY - Maps tiles generation on einsteinium is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [06:27:27] PROBLEM - puppet last run on scb2002 is CRITICAL: CRITICAL: Puppet has 18 failures. Last run 3 minutes ago with 18 failures. Failed resources (up to 3 shown): Package[cpjobqueue/deploy],Exec[chown /srv/deployment/cpjobqueue for deploy-service],Package[recommendation-api/deploy],Exec[chown /srv/deployment/recommendation-api for deploy-service] [06:57:48] RECOVERY - puppet last run on scb2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:50:37] I just gor a MediaWiki internal error. [07:50:42] Original exception: [W0HB1QpAICAAABpvRNYAAACC] 2018-07-08 07:49:37: Fatal exception of type "Wikimedia\Rdbms\DBQueryTimeoutError" [07:50:54] How do I set $wgShowExceptionDetails = true; and $wgShowDBErrorBacktrace = true; at the bottom of LocalSettings.php to show detailed debugging information? [07:52:04] akosiaris ^^ [07:52:12] Shoudl I report that, or just ...move on? [07:57:50] depends what you were doing [07:58:08] you can always file a task on phab and tag it with wikimedia-log-errors and someone can have a look at it [07:59:28] On a ship with terrible wifi...could barely get the javascript to work to add the tag, but I did it https://phabricator.wikimedia.org/T199044 [08:00:13] Not sure what more tags to add though...hopefully our wonderful Aklapper might go around and tag it [08:38:25] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Review analytics-in4 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10elukey) >>! In T198623#4404041, @ayounsi wrote: > ``` > show firewall family inet filter analytics-in4 term archiva > from { > de... [10:45:48] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 47.40, 36.78, 23.71 [10:46:17] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 50.27, 33.10, 20.08 [11:01:17] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 9.15, 18.27, 23.99 [11:04:57] RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 8.28, 15.27, 23.02 [11:27:36] !log restart rsyslog on lithium - in:imtcp thread stuck at 99% cpu usage [11:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:07] RECOVERY - Debian mirror in sync with upstream on sodium is OK: /srv/mirrors/debian is over 0 hours old. [12:59:22] (03CR) 10Mobrovac: "> Does it also need a cleanup in role::restbase::alerts ?" [puppet] - 10https://gerrit.wikimedia.org/r/444247 (https://phabricator.wikimedia.org/T186567) (owner: 10Mobrovac) [13:10:52] (03PS1) 10Mobrovac: RESTBase: Remove obsolete $wgRestbaseServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444440 [13:35:31] 10Operations, 10ops-eqiad, 10DBA: db1069 bad disk - https://phabricator.wikimedia.org/T199056 (10Marostegui) [13:35:44] 10Operations, 10ops-eqiad, 10DBA: db1069 bad disk - https://phabricator.wikimedia.org/T199056 (10Marostegui) p:05Triage>03Normal [13:37:47] ACKNOWLEDGEMENT - Device not healthy -SMART- on db1069 is CRITICAL: cluster=mysql device=megaraid,0 instance=db1069:9100 job=node site=eqiad Marostegui T199056 - The acknowledgement expires at: 2018-07-11 13:37:28. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1069&var-datasource=eqiad%2520prometheus%252Fops [14:38:56] (03CR) 10Vgutierrez: [C: 031] Blacklist floppy driver [puppet] - 10https://gerrit.wikimedia.org/r/444238 (owner: 10Muehlenhoff) [16:16:12] (03PS1) 10MarcoAurelio: throttle: lift limits for Kaqchikel edit-a-thon; clear expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444443 (https://phabricator.wikimedia.org/T199040) [16:17:46] (03CR) 10jerkins-bot: [V: 04-1] throttle: lift limits for Kaqchikel edit-a-thon; clear expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444443 (https://phabricator.wikimedia.org/T199040) (owner: 10MarcoAurelio) [16:30:11] 10Operations, 10SRE-Access-Requests: WMF-NDA-Request for User:Braveheart - https://phabricator.wikimedia.org/T198190 (10Braveheart) Thanks Akosiaris, currently talking to analytics how to continue. [16:30:14] (03PS2) 10MarcoAurelio: throttle: lift limits for Kaqchikel edit-a-thon; clear expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444443 (https://phabricator.wikimedia.org/T199040) [16:50:24] 10Operations, 10Beta-Cluster-Infrastructure, 10Security-Team, 10Patch-For-Review: Delete deployment-mediawiki06 - https://phabricator.wikimedia.org/T192996 (10Krenair) @elukey you created this instance, know what it's for? [16:59:45] 10Operations, 10Beta-Cluster-Infrastructure, 10Security-Team, 10Patch-For-Review: Delete deployment-mediawiki06 - https://phabricator.wikimedia.org/T192996 (10elukey) It was probably a regular mw appserver for beta, don't have specific memories about it. [17:09:45] (03CR) 10Ppchelko: [C: 031] RESTBase: Remove obsolete $wgRestbaseServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444440 (owner: 10Mobrovac) [17:20:37] PROBLEM - Host analytics1060 is DOWN: PING CRITICAL - Packet loss = 100% [17:20:57] RECOVERY - Host analytics1060 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [17:26:08] PROBLEM - puppet last run on lvs4007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:38] RECOVERY - puppet last run on lvs4007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:32:06] 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Puppet broken on deployment-mx due to systemd on trusty - https://phabricator.wikimedia.org/T184244 (10Krenair) So now I think we just need to get https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/436431/ merged ? [18:39:15] 10Puppet, 10Cloud-Services: Migrate references from $instance.eqiad.wmflabs to $instance.$project.eqiad.wmflabs - https://phabricator.wikimedia.org/T153608 (10Krenair) [18:40:47] (03PS6) 10Alex Monk: deployment-prep logstash: replace deployment-tin reference [puppet] - 10https://gerrit.wikimedia.org/r/438001 (https://phabricator.wikimedia.org/T192071) (owner: 10Dzahn) [18:40:56] (03CR) 10Alex Monk: [C: 031] deployment-prep logstash: replace deployment-tin reference [puppet] - 10https://gerrit.wikimedia.org/r/438001 (https://phabricator.wikimedia.org/T192071) (owner: 10Dzahn) [18:41:06] 10Puppet, 10Cloud-Services: Migrate references from $instance.eqiad.wmflabs to $instance.$project.eqiad.wmflabs - https://phabricator.wikimedia.org/T153608 (10Krenair) The deployment-prep ones will be fixed with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/436431/ and https://gerrit.wikimedia.org/r/#... [18:44:46] 10Operations, 10Beta-Cluster-Infrastructure, 10Security-Team, 10Patch-For-Review: Delete deployment-mediawiki06 - https://phabricator.wikimedia.org/T192996 (10Krenair) Alright let's merge the above and delete this instance? [18:53:38] PROBLEM - High load average on labstore1007 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [24.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [19:10:11] !log disable puppet on labstore1007 to turn down ratelimits for nginx (load is rising and rising) [19:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:14] (03PS1) 10Rush: labstore: reduce nfs dumps read allowed while in failure mode [puppet] - 10https://gerrit.wikimedia.org/r/444463 [19:16:18] PROBLEM - very high load average likely xfs on ms-be1017 is CRITICAL: CRITICAL - load average: 179.11, 109.44, 60.90 [19:17:57] PROBLEM - MD RAID on ms-be1017 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 [19:17:58] ACKNOWLEDGEMENT - MD RAID on ms-be1017 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T199063 [19:18:14] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1017 - https://phabricator.wikimedia.org/T199063 (10ops-monitoring-bot) [19:20:27] PROBLEM - Disk space on ms-be1017 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdl1 is not accessible: Input/output error [19:20:38] PROBLEM - Check systemd state on ms-be1017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:21:07] PROBLEM - swift-container-updater on ms-be1017 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [19:21:57] RECOVERY - very high load average likely xfs on ms-be1017 is OK: OK - load average: 15.49, 70.57, 62.15 [19:31:58] PROBLEM - Device not healthy -SMART- on ms-be1017 is CRITICAL: cluster=swift device={cciss,10,cciss,11,cciss,12,cciss,13,cciss,3,cciss,4,cciss,5,cciss,6,cciss,7,cciss,8,cciss,9} instance=ms-be1017:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be1017&var-datasource=eqiad%2520prometheus%252Fops [19:42:09] RECOVERY - High load average on labstore1007 is OK: OK: Less than 50.00% above the threshold [16.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [19:43:30] (03PS2) 10Rush: labstore: reduce nfs dumps read allowed while in failure mode [puppet] - 10https://gerrit.wikimedia.org/r/444463 [19:45:17] PROBLEM - HP RAID on ms-be1035 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:50:25] 10Operations, 10Traffic, 10HTTPS, 10Patch-For-Review: letsencrypt puppetization: upgrade for scalability - https://phabricator.wikimedia.org/T134447 (10Krenair) > build an abstraction around this for large subject counts across multiple auto-split certs (for secure direct case, and probably also beta clust... [19:50:44] (03PS3) 10Rush: labstore: reduce nfs dumps read allowed while in failure mode [puppet] - 10https://gerrit.wikimedia.org/r/444463 [19:51:31] (03CR) 10Rush: [C: 032] labstore: reduce nfs dumps read allowed while in failure mode [puppet] - 10https://gerrit.wikimedia.org/r/444463 (owner: 10Rush) [20:05:56] (03Abandoned) 10Awight: Install LFS on scap targets [puppet] - 10https://gerrit.wikimedia.org/r/437719 (https://phabricator.wikimedia.org/T180627) (owner: 10Awight) [20:06:21] (03CR) 10Awight: [C: 04-1] "I think we can abandon, we have Icde5d4e6d9c6b7204" [puppet] - 10https://gerrit.wikimedia.org/r/432432 (owner: 10Halfak) [20:14:34] (03Abandoned) 10Awight: Enable ORES on simplewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395059 (https://phabricator.wikimedia.org/T181848) (owner: 10Awight) [20:16:06] (03Abandoned) 10Awight: [DNM] Enable Extension:JADE on all beta cluster wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414771 (https://phabricator.wikimedia.org/T176333) (owner: 10Awight) [20:16:15] (03Abandoned) 10Awight: Add eswikibooks and svwiki to the beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/414873 (https://phabricator.wikimedia.org/T174560) (owner: 10Awight) [20:19:45] (03PS1) 10Krinkle: RESTBase: Remove obsolete $wgRestbaseServer (2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444466 [20:20:04] (03PS2) 10Krinkle: RESTBase: Remove obsolete $wgRestbaseServer (2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444466 [20:20:29] (03PS2) 10Krinkle: RESTBase: Remove obsolete $wgRestbaseServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444440 (owner: 10Mobrovac) [20:20:47] (03CR) 10Krinkle: "Moved removal to second patch in order to be safe for scap and stagable with scap pull." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444440 (owner: 10Mobrovac) [20:20:56] (03PS3) 10Krinkle: RESTBase: Remove obsolete $wgRestbaseServer (2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444466 [20:21:11] (03PS1) 10Bodhisattwa: Enable transwiki import to ru.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444467 (https://phabricator.wikimedia.org/T199045) [20:21:39] (03PS4) 10Krinkle: RESTBase: Remove obsolete $wgRestbaseServer (2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444466 [20:22:48] * Krinkle staging on deploy1001/mwdebug1002 [20:22:51] (03CR) 10Krinkle: [C: 032] RESTBase: Remove obsolete $wgRestbaseServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444440 (owner: 10Mobrovac) [20:24:33] (03Merged) 10jenkins-bot: RESTBase: Remove obsolete $wgRestbaseServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444440 (owner: 10Mobrovac) [20:26:47] RECOVERY - HP RAID on ms-be1035 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK [20:27:56] (03CR) 10jenkins-bot: RESTBase: Remove obsolete $wgRestbaseServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444440 (owner: 10Mobrovac) [20:28:25] (03CR) 10Krinkle: [C: 032] RESTBase: Remove obsolete $wgRestbaseServer (2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444466 (owner: 10Krinkle) [20:29:57] (03Merged) 10jenkins-bot: RESTBase: Remove obsolete $wgRestbaseServer (2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444466 (owner: 10Krinkle) [20:30:06] !log krinkle@deploy1001 Synchronized wmf-config/CommonSettings.php: Id94ca5f55c2c6 (duration: 00m 53s) [20:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:21] (03CR) 10jenkins-bot: RESTBase: Remove obsolete $wgRestbaseServer (2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444466 (owner: 10Krinkle) [20:37:06] !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: I81d566ba860 (duration: 00m 51s) [20:37:07] (03PS1) 10Andrew Bogott: Designate pool config files: mask out a password [puppet] - 10https://gerrit.wikimedia.org/r/444468 [20:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:57] (03CR) 10Andrew Bogott: [C: 032] Designate pool config files: mask out a password [puppet] - 10https://gerrit.wikimedia.org/r/444468 (owner: 10Andrew Bogott) [20:40:17] PROBLEM - Check for gridmaster host resolution UDP on labservices1002 is CRITICAL: CRITICAL - Plugin timed out while executing system call [20:40:22] (03PS2) 10Bodhisattwa: Enable transwiki import to ru.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444467 (https://phabricator.wikimedia.org/T199045) [20:40:28] PROBLEM - Auth DNS on labservices1002 is CRITICAL: CRITICAL - Plugin timed out while executing system call [20:41:02] (03PS3) 10Bodhisattwa: Enable transwiki import to ru.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444467 (https://phabricator.wikimedia.org/T199045) [20:46:58] RECOVERY - Auth DNS on labservices1002 is OK: DNS OK: 0.011 seconds response time. labs-ns1.wikimedia.org returns [20:47:48] RECOVERY - Check for gridmaster host resolution UDP on labservices1002 is OK: DNS OK - 0.009 seconds response time (tools-grid-master.tools.eqiad.wmflabs. 60 IN A 10.68.20.158) [20:48:19] (03CR) 10Alex Monk: Designate pool config files: mask out a password (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/444468 (owner: 10Andrew Bogott) [20:50:30] (03PS1) 10Andrew Bogott: designate: mask out one more password [puppet] - 10https://gerrit.wikimedia.org/r/444469 (https://phabricator.wikimedia.org/T199065) [20:51:21] (03CR) 10Andrew Bogott: [C: 032] designate: mask out one more password [puppet] - 10https://gerrit.wikimedia.org/r/444469 (https://phabricator.wikimedia.org/T199065) (owner: 10Andrew Bogott) [21:03:13] (03CR) 10MarcoAurelio: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444467 (https://phabricator.wikimedia.org/T199045) (owner: 10Bodhisattwa) [21:07:24] Krinkle, what's your opinion on 440092 / T197095 ? [21:07:25] T197095: Clean duplicate right declaration - https://phabricator.wikimedia.org/T197095 [21:07:41] (you CR-1ed it, that's why I ask) [21:12:11] Urbanecm: The task does not explain to me what problem exists and why and how it proposes to solve it. [21:12:53] I think it says it, but I do not understand it. [21:13:10] Maybe that just means someone else should review it. But my -1 was not about the task, it was about the commit message. [21:13:41] If I understand correctly, what you mean is that all bureaucrats on these wikis are also sysop, and sysop already grants this right, so it can be removed. [21:13:48] If that is the case, I recommend adding it to the commit message. [21:14:10] (03PS3) 10Urbanecm: Do not assign sysop level privileges to bureaucrats explicitely [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440092 (https://phabricator.wikimedia.org/T197095) [21:14:21] (03CR) 10jerkins-bot: [V: 04-1] Do not assign sysop level privileges to bureaucrats explicitely [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440092 (https://phabricator.wikimedia.org/T197095) (owner: 10Urbanecm) [21:14:42] (03PS4) 10Urbanecm: Don't assign sysop-level privileges to bureaucrats explicitly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440092 (https://phabricator.wikimedia.org/T197095) [21:14:54] (03CR) 10jerkins-bot: [V: 04-1] Don't assign sysop-level privileges to bureaucrats explicitly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440092 (https://phabricator.wikimedia.org/T197095) (owner: 10Urbanecm) [21:14:58] Exactly. [21:15:04] Is commit message better now Krinkle? [21:17:21] (03PS5) 10Urbanecm: Don't assign sysop-level privileges to bureaucrats explicitly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440092 (https://phabricator.wikimedia.org/T197095) [21:18:02] If so, please remove the negative review (it didn't disappear after changing just a commit message). Thanks! [21:38:04] (03PS4) 10Urbanecm: Add importsources to ru.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444467 (https://phabricator.wikimedia.org/T199045) (owner: 10Bodhisattwa) [21:38:12] (03CR) 10Urbanecm: [C: 031] Add importsources to ru.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444467 (https://phabricator.wikimedia.org/T199045) (owner: 10Bodhisattwa) [22:03:19] (03CR) 10Krinkle: [C: 031] Stop loading the MwEmbedSupport extension, part I [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441519 (owner: 10Jforrester) [23:09:27] PROBLEM - HP RAID on ms-be1028 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.