[00:14:05] 10Operations, 10Analytics, 10Analytics-Cluster: furud - DISK CRITICAL - /mnt/hdfs is not accessible: Input/output error - https://phabricator.wikimedia.org/T221483 (10Nuria) Many thanks @Dzahn [00:43:17] 10Operations, 10Analytics, 10Analytics-Cluster: furud - DISK CRITICAL - /mnt/hdfs is not accessible: Input/output error - https://phabricator.wikimedia.org/T221483 (10Peachey88) [01:22:04] (03PS1) 10Alex Monk: network::constants: Move puppet_frontends to using existing data and resources [puppet] - 10https://gerrit.wikimedia.org/r/505356 (https://phabricator.wikimedia.org/T220894) [01:22:53] (03CR) 10jerkins-bot: [V: 04-1] network::constants: Move puppet_frontends to using existing data and resources [puppet] - 10https://gerrit.wikimedia.org/r/505356 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [01:23:03] 10Operations, 10Patch-For-Review: Replacement of network::constant's special_hosts - https://phabricator.wikimedia.org/T220894 (10Krenair) [01:27:48] (03PS2) 10Alex Monk: network::constants: Move puppet_frontends to using existing data and resources [puppet] - 10https://gerrit.wikimedia.org/r/505356 (https://phabricator.wikimedia.org/T220894) [01:28:34] (03CR) 10jerkins-bot: [V: 04-1] network::constants: Move puppet_frontends to using existing data and resources [puppet] - 10https://gerrit.wikimedia.org/r/505356 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [01:31:52] (03PS3) 10Alex Monk: network::constants: Move puppet_frontends to using existing data and resources [puppet] - 10https://gerrit.wikimedia.org/r/505356 (https://phabricator.wikimedia.org/T220894) [01:32:38] (03CR) 10jerkins-bot: [V: 04-1] network::constants: Move puppet_frontends to using existing data and resources [puppet] - 10https://gerrit.wikimedia.org/r/505356 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [01:38:58] (03PS4) 10Alex Monk: network::constants: Move puppet_frontends to using existing data and resources [puppet] - 10https://gerrit.wikimedia.org/r/505356 (https://phabricator.wikimedia.org/T220894) [01:39:45] (03CR) 10jerkins-bot: [V: 04-1] network::constants: Move puppet_frontends to using existing data and resources [puppet] - 10https://gerrit.wikimedia.org/r/505356 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [01:44:41] PROBLEM - puppet last run on mw1286 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:46:44] (03PS5) 10Alex Monk: network::constants: Move puppet_frontends to using existing data and resources [puppet] - 10https://gerrit.wikimedia.org/r/505356 (https://phabricator.wikimedia.org/T220894) [01:47:32] (03CR) 10jerkins-bot: [V: 04-1] network::constants: Move puppet_frontends to using existing data and resources [puppet] - 10https://gerrit.wikimedia.org/r/505356 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [01:49:03] (03PS6) 10Alex Monk: network::constants: Move puppet_frontends to using existing data and resources [puppet] - 10https://gerrit.wikimedia.org/r/505356 (https://phabricator.wikimedia.org/T220894) [01:49:48] (03CR) 10jerkins-bot: [V: 04-1] network::constants: Move puppet_frontends to using existing data and resources [puppet] - 10https://gerrit.wikimedia.org/r/505356 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [01:52:53] (03PS7) 10Alex Monk: network::constants: Move puppet_frontends to using existing data instead [puppet] - 10https://gerrit.wikimedia.org/r/505356 (https://phabricator.wikimedia.org/T220894) [01:53:31] PROBLEM - puppet last run on puppetmaster1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:57:58] (03PS8) 10Alex Monk: network::constants: Move puppet_frontends to using existing data instead [puppet] - 10https://gerrit.wikimedia.org/r/505356 (https://phabricator.wikimedia.org/T220894) [01:58:18] (03CR) 10Alex Monk: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/505356 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [01:58:44] (03CR) 10jerkins-bot: [V: 04-1] network::constants: Move puppet_frontends to using existing data instead [puppet] - 10https://gerrit.wikimedia.org/r/505356 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [01:59:59] (03CR) 10jerkins-bot: [V: 04-1] network::constants: Move puppet_frontends to using existing data instead [puppet] - 10https://gerrit.wikimedia.org/r/505356 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [02:15:19] (03CR) 10Alex Monk: "Observations from https://puppet-compiler.wmflabs.org/compiler1002/156/:" [puppet] - 10https://gerrit.wikimedia.org/r/505356 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [02:16:29] RECOVERY - puppet last run on mw1286 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [02:18:35] 10Operations, 10fundraising-tech-ops, 10netops: Network setup for frmon2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T221475 (10cwdent) @ayounsi - the new policies are at 1555726449, let me know if you need anything else thanks [02:25:19] RECOVERY - puppet last run on puppetmaster1002 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [02:44:15] (03PS1) 10Alex Monk: network::constants: Remove seemingly unused druid_analytics_hosts [puppet] - 10https://gerrit.wikimedia.org/r/505366 (https://phabricator.wikimedia.org/T220894) [02:44:39] (03CR) 10Alex Monk: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/505366 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [02:45:07] (03CR) 10jerkins-bot: [V: 04-1] network::constants: Remove seemingly unused druid_analytics_hosts [puppet] - 10https://gerrit.wikimedia.org/r/505366 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [02:45:12] (03CR) 10jerkins-bot: [V: 04-1] network::constants: Remove seemingly unused druid_analytics_hosts [puppet] - 10https://gerrit.wikimedia.org/r/505366 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [02:46:40] (03PS2) 10Alex Monk: network::constants: Remove seemingly unused druid_analytics_hosts [puppet] - 10https://gerrit.wikimedia.org/r/505366 (https://phabricator.wikimedia.org/T220894) [02:48:14] (03CR) 10Alex Monk: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/505366 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [03:27:51] PROBLEM - puppet last run on mw1340 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:54:23] RECOVERY - puppet last run on mw1340 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [03:57:05] PROBLEM - puppet last run on doc1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:23:04] (03PS1) 10Alex Monk: network::constants: Move various analytics special_hosts to hiera [puppet] - 10https://gerrit.wikimedia.org/r/505373 (https://phabricator.wikimedia.org/T220894) [04:23:37] RECOVERY - puppet last run on doc1001 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [04:23:51] 10Operations, 10Patch-For-Review: Replacement of network::constant's special_hosts - https://phabricator.wikimedia.org/T220894 (10Krenair) [04:34:05] PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:44:29] PROBLEM - puppet last run on cp3042 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:50:07] PROBLEM - puppet last run on mw1343 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:55:37] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 111 probes of 410 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [04:56:11] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 152 probes of 449 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [05:05:53] RECOVERY - puppet last run on labsdb1009 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [05:06:15] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 15 probes of 410 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [05:06:47] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 4 probes of 449 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [05:11:03] RECOVERY - puppet last run on cp3042 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [05:21:55] RECOVERY - puppet last run on mw1343 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [06:22:29] PROBLEM - puppet last run on wtp1037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:24:59] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2047 - https://phabricator.wikimedia.org/T221481 (10Marostegui) a:03Papaul Let's replace the failed disk first only, disk #12 [06:29:35] PROBLEM - puppet last run on mw1300 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/tmpreaper.conf] [06:29:53] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission db2014,db2020, db2021, db2022, db2024, db2031 - https://phabricator.wikimedia.org/T221424 (10Marostegui) All those hosts were decommissioned as part of T176243, so probably a leftover from that. Removing our tag as there is nothing for us to... [06:32:23] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission db2014,db2020, db2021, db2022, db2024, db2031 - https://phabricator.wikimedia.org/T221424 (10Marostegui) [06:54:19] RECOVERY - puppet last run on wtp1037 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [06:56:01] RECOVERY - puppet last run on mw1300 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:04:13] PROBLEM - puppet last run on cloudvirtan1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:30:43] RECOVERY - puppet last run on cloudvirtan1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:41:19] (03PS5) 10Giuseppe Lavagetto: profile::mediawiki::php: tweak ini settings [puppet] - 10https://gerrit.wikimedia.org/r/502986 (https://phabricator.wikimedia.org/T211488) [07:42:25] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::mediawiki::php: tweak ini settings [puppet] - 10https://gerrit.wikimedia.org/r/502986 (https://phabricator.wikimedia.org/T211488) (owner: 10Giuseppe Lavagetto) [07:42:38] (03PS6) 10Giuseppe Lavagetto: profile::mediawiki::php: tweak ini settings [puppet] - 10https://gerrit.wikimedia.org/r/502986 (https://phabricator.wikimedia.org/T211488) [07:50:13] <_joe_> !log restarting php-fpm on mw1312, mw1261 to test the new settings over the weekend [07:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:01] !log depool thumbor1001, switch back to nginx - T187765 [07:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:06] T187765: Replace the Nginx fronting Thumbor with a reverse proxy capable of queuing requests - https://phabricator.wikimedia.org/T187765 [07:56:59] RECOVERY - Check systemd state on thumbor1001 is OK: OK - running: The system is fully operational [07:58:00] !log Pool thumbor1001 [07:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:33] RECOVERY - puppet last run on thumbor1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:02:47] Wikidata main page throwing 404 with "No input file specified." https://www.wikidata.org/wiki/Wikidata:Main_Page [08:06:23] <_joe_> sjoerddebruin: wmf [08:06:50] <_joe_> sjoerddebruin: do you have more info, like response headers, that you can paste somewhere? [08:07:18] <_joe_> ok I just reproduced it under php7 [08:08:33] <_joe_> sjoerddebruin: are you using php7 as a beta feature? [08:08:44] Yes, and on mw1261 according to the response header [08:08:51] <_joe_> ok good [08:09:01] <_joe_> it's one of the servers where I restarted php earlier [08:09:07] <_joe_> per SAL [08:09:19] <_joe_> now lemme find out what was wrong in the ini changes we made [08:09:27] <_joe_> thanks for reporting it [08:11:59] no problem :) [08:12:02] <_joe_> lemme depool it for now [08:12:33] <_joe_> !log depooling mw1261,mw1312 wikidata (at least) not working [08:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:02] visiting articles gives Input File Not found and now the home page is this for me. Servers all okay? https://usercontent.irccloud-cdn.com/file/K6rOWYWB/wikierrors.PNG [08:13:16] <_joe_> sjoerddebruin: still seeing that version of the page? [08:13:33] _joe_: yes [08:13:34] "No input file specified."* [08:13:49] <_joe_> TheSandDoctor: what page? it was a single server causing it [08:13:56] https://en.wikipedia.org/wiki/Tinderbox [08:14:17] <_joe_> uhm this I can't reproduce now [08:14:21] @_joe_ my twinkle options file has also been failing for the post while [08:14:24] <_joe_> with php7 [08:14:33] other wikidata pages work "fine", error also on https://www.wikidata.org/w/load.php?lang=nl&modules=site.styles&only=styles&skin=vector [08:14:35] and https://www.wikidata.org/w/load.php?lang=nl&modules=ext.gadget.InterProjectLinks&only=styles&skin=vector [08:14:36] i have php7 on yes [08:14:36] <_joe_> TheSandDoctor: still seeing that error? [08:14:39] yes [08:14:52] <_joe_> because I'm not :/ [08:15:02] <_joe_> so if you need to unblock yourself right now [08:15:09] <_joe_> just switch back to hhvm, I'm sorry [08:15:13] https://usercontent.irccloud-cdn.com/file/3a3qQLMH/wikierror2.PNG [08:15:17] <_joe_> I will try to ban those pages [08:15:20] <_joe_> from cache [08:15:44] <_joe_> TheSandDoctor: heh I do see it, and I have php7, but I guess we end up in different caching sites [08:15:46] bypassing my local cache produces the same error as the last screenshot @_joe_ [08:16:05] <_joe_> not your local cache, the edge cache [08:16:12] <_joe_> which we keep separated between php7 and hhvm [08:16:23] <_joe_> so the error is only under php7, and only on certain pages [08:16:34] for me enwiki main page lost its css [08:16:43] or at least a good portion of it [08:17:01] <_joe_> TheSandDoctor: can you switch back the beta feature please and tell me if you still have the same problem? [08:17:24] I've fixed the problem by removing the PHP_ENGINE session cookie [08:17:38] And it's back on reload... [08:17:41] <_joe_> you need to untick the beta feature [08:17:44] <_joe_> it will readd it back [08:18:10] @_joe_ unticking the beta feature fixed it [08:18:13] beta server issue? [08:18:21] <_joe_> TheSandDoctor: my fault actually [08:18:31] <_joe_> I tried to change config on one server, and it created issues [08:18:40] <_joe_> I'm now trying to figure out what caused it [08:18:46] <_joe_> but it was one server out of 100 [08:18:51] huh [08:19:06] <_joe_> and it was returning responses like all was ok, just with the wrong content [08:19:13] weird [08:19:30] <_joe_> sjoerddebruin: the wikidata main page now loads for me [08:19:38] <_joe_> with php7 [08:19:57] thanks for helping make wikipedia accessible again for me though @__joe__. Hopefully you are able to resolve it all :) [08:20:06] I'll try enabling it again in the morning [08:20:10] * TheSandDoctor sleeps [08:20:14] <_joe_> TheSandDoctor: well I caused your problems in the first place :/ [08:20:36] _joe_: no change here on php7 [08:20:43] <_joe_> the problems should solve themselves in a few minutes though [08:20:54] <_joe_> sjoerddebruin: what's your x-cache header, if I may? [08:21:07] x-cache: cp1085 pass, cp3041 hit/3, cp3040 hit/58 [08:21:23] <_joe_> oh right one specific frontend cached it [08:23:13] _joe_: now it works [08:23:22] <_joe_> I just purged it [08:23:26] <_joe_> from all caches [08:23:47] <_joe_> but it would've gone away itself soon [08:24:56] i still have a few modules refusing to load, but no need to edit anytime soon... [08:28:19] <[1997kB]> so is that why Template:Navbox looks like this https://en.wikipedia.org/wiki/User:1997kB/tools ? [08:33:36] <_joe_> [1997kB]: I'm not sure what's wrong there [08:33:45] <_joe_> I do see the page correctly, and I'm on the php7 beta [08:34:26] <_joe_> [1997kB]: still having the issue? [08:35:12] <[1997kB]> well some magic fixed it. [08:35:13] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::php: remove doc_root setting [puppet] - 10https://gerrit.wikimedia.org/r/505383 [08:36:22] <_joe_> I really didn't expect changing the value on one server would cause such havoc :/ [08:36:33] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::mediawiki::php: remove doc_root setting [puppet] - 10https://gerrit.wikimedia.org/r/505383 (owner: 10Giuseppe Lavagetto) [08:37:57] <_joe_> I guess this is the rightful punishment for trying this on a weekend just because I couldn't finish yesterday [08:49:33] @__joe__ Re-enabled php7 beta feature and now everything works [08:49:35] 10Operations, 10PHP 7.2 support, 10Patch-For-Review, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) FWIW, the `doc_root` being set was causing severe issues under php7.2. I removed it from the list of ini settings... [08:49:43] thanks for resolving it @_joe_ [08:49:43] <_joe_> TheSandDoctor: <3 [08:50:05] <_joe_> no thank you for saving me from spoiling my easter because I want to finish something :) [08:50:10] * TheSandDoctor just thought mediawiki was out to get him this evening when twinkle and purge stopped working :P [08:54:42] PROBLEM - Apache HTTP on mw1317 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:56:00] RECOVERY - Apache HTTP on mw1317 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:28:37] 10Operations, 10PHP 7.2 support, 10Wikimedia-production-error: Investigate why a string literal changed in opcache (Fatal exception of type "ConfigException") - https://phabricator.wikimedia.org/T221347 (10Daimona) Yes, that's correct. However I'd encourage to investigate this issue now: you know, logs get g... [09:37:31] 10Operations, 10PHP 7.2 support, 10Wikimedia-production-error: Investigate why a string literal changed in opcache (Fatal exception of type "ConfigException") - https://phabricator.wikimedia.org/T221347 (10Joe) I realized I failed to update this ticket with my investigation: looking at the metrics we collec... [09:41:15] 10Operations, 10DBA, 10MediaWiki-Database, 10MediaWiki-Logging, 10Wikimedia-production-error: Special:Log on commons -- entire web request took longer than 60 seconds and timed out - https://phabricator.wikimedia.org/T221458 (10Framawiki) p:05High→03Unbreak! [10:20:34] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 109 probes of 410 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [10:21:08] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 152 probes of 449 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [10:31:12] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 18 probes of 410 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [10:31:42] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 5 probes of 449 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:01:52] 10Operations, 10PHP 7.2 support, 10Wikimedia-production-error: Investigate why a string literal changed in opcache (Fatal exception of type "ConfigException") - https://phabricator.wikimedia.org/T221347 (10Urbanecm) >>! In T221347#5126757, @Joe wrote: > Btw, logs don't tell you much in the case you meet such... [11:55:12] (03CR) 10Volans: "Given that we don't support anymore python2 I guess we could drop mock as 3rd party module and replace its usage with the stdlib version (" [software/conftool] - 10https://gerrit.wikimedia.org/r/504980 (owner: 10CDanis) [12:45:48] PROBLEM - HHVM rendering on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:46:58] RECOVERY - HHVM rendering on mw1285 is OK: HTTP OK: HTTP/1.1 200 OK - 74272 bytes in 0.803 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:52:04] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:18:32] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:26:32] (03PS7) 10Andrew Bogott: cloud dns: move primary services to cloud-ns0 and cloud-ns1 [puppet] - 10https://gerrit.wikimedia.org/r/504572 (https://phabricator.wikimedia.org/T221183) [14:44:32] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 145 probes of 446 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [14:48:10] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 110 probes of 407 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [14:48:33] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Track remaining trusty servers in production - https://phabricator.wikimedia.org/T212772 (10Andrew) [14:49:22] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Track remaining trusty servers in production - https://phabricator.wikimedia.org/T212772 (10Andrew) [14:55:06] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 10 probes of 446 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [14:58:18] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:58:46] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 15 probes of 407 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [15:07:55] (03CR) 10Andrew Bogott: "I'm happy with the compiler results for this... will merge when I have time to babysit." [puppet] - 10https://gerrit.wikimedia.org/r/504572 (https://phabricator.wikimedia.org/T221183) (owner: 10Andrew Bogott) [15:08:13] (03CR) 10Andrew Bogott: "https://puppet-compiler.wmflabs.org/compiler1002/15926/" [puppet] - 10https://gerrit.wikimedia.org/r/504572 (https://phabricator.wikimedia.org/T221183) (owner: 10Andrew Bogott) [15:12:55] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Refactor public-facing DYNA scheme for primary project hostnames in our DNS - https://phabricator.wikimedia.org/T208263 (10BBlack) Status update on the experiments above: * No known reports or evidence of resolution failures so far,... [15:29:04] PROBLEM - puppet last run on ms-be1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:30:06] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [15:30:38] PROBLEM - puppet last run on elastic1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:42] (03PS1) 10Alex Monk: mariadb: Replace role::mariadb::ferm with profile::mariadb::ferm [puppet] - 10https://gerrit.wikimedia.org/r/505406 [15:49:42] 10Operations, 10DBA, 10MediaWiki-Database, 10MediaWiki-Logging, 10Wikimedia-production-error: Special:Log on commons -- entire web request took longer than 60 seconds and timed out - https://phabricator.wikimedia.org/T221458 (10Marostegui) This is the query I believe - looks like the optimizer is being d... [15:53:34] (03PS1) 10Alex Monk: network::constants: Move mysql_root_clients from special_hosts to hieradata [puppet] - 10https://gerrit.wikimedia.org/r/505407 (https://phabricator.wikimedia.org/T220894) [15:54:10] 10Operations, 10Patch-For-Review: Replacement of network::constant's special_hosts - https://phabricator.wikimedia.org/T220894 (10Krenair) [15:54:31] (03CR) 10jerkins-bot: [V: 04-1] network::constants: Move mysql_root_clients from special_hosts to hieradata [puppet] - 10https://gerrit.wikimedia.org/r/505407 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [15:55:34] RECOVERY - puppet last run on ms-be1023 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [15:56:01] (03PS2) 10Alex Monk: network::constants: Move mysql_root_clients from special_hosts to hieradata [puppet] - 10https://gerrit.wikimedia.org/r/505407 (https://phabricator.wikimedia.org/T220894) [15:56:48] (03CR) 10jerkins-bot: [V: 04-1] network::constants: Move mysql_root_clients from special_hosts to hieradata [puppet] - 10https://gerrit.wikimedia.org/r/505407 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [15:57:04] RECOVERY - puppet last run on elastic1051 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:59:04] (03PS3) 10Alex Monk: network::constants: Move mysql_root_clients from special_hosts to hieradata [puppet] - 10https://gerrit.wikimedia.org/r/505407 (https://phabricator.wikimedia.org/T220894) [16:00:29] (03CR) 10Alex Monk: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/505407 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [16:01:36] (03PS9) 10Alex Monk: network::constants: Move puppet_frontends to using existing data instead [puppet] - 10https://gerrit.wikimedia.org/r/505356 (https://phabricator.wikimedia.org/T220894) [16:17:34] PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:25:30] (03PS1) 10Alex Monk: gerrit::proxy: Switch to strong SSL settings [puppet] - 10https://gerrit.wikimedia.org/r/505410 (https://phabricator.wikimedia.org/T221499) [16:26:06] (03CR) 10Alex Monk: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/505410 (https://phabricator.wikimedia.org/T221499) (owner: 10Alex Monk) [16:26:37] (03CR) 10Paladox: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/505410 (https://phabricator.wikimedia.org/T221499) (owner: 10Alex Monk) [16:28:18] (03CR) 10Alex Monk: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/505373 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [16:32:13] Urgh, probably should've added a Hosts: line to that gerrit change before running PCC. Oh well. [16:38:51] (03CR) 10Alex Monk: "Unknown resource type: '::profile::mariadb::ferm' at /srv/jenkins-workspace/puppet-compiler/159/change/src/modules/role/manifests/mariadb/" [puppet] - 10https://gerrit.wikimedia.org/r/505407 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [16:44:04] RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:50:39] 10Operations, 10Core Platform Team, 10DBA, 10MediaWiki-Database, and 2 others: Special:Log on commons -- entire web request took longer than 60 seconds and timed out - https://phabricator.wikimedia.org/T221458 (10Marostegui) [17:04:02] 10Operations, 10serviceops, 10Patch-For-Review: setup/install WMF7426 as phab1003.eqiad.wmnet - https://phabricator.wikimedia.org/T221389 (10mmodell) fwiw my irc client highlights both names but I rarely log in to the @20after4 phab account. [17:08:53] (03CR) 10Alex Monk: "Based on https://puppet-compiler.wmflabs.org/compiler1002/160/cobalt.wikimedia.org/ this removes support for the following cipher suites t" [puppet] - 10https://gerrit.wikimedia.org/r/505410 (https://phabricator.wikimedia.org/T221499) (owner: 10Alex Monk) [17:34:14] 10Operations, 10Core Platform Team, 10DBA, 10MediaWiki-Database, and 3 others: Special:Log on commons -- entire web request took longer than 60 seconds and timed out - https://phabricator.wikimedia.org/T221458 (10Aklapper) [17:41:04] PROBLEM - Disk space on elastic1017 is CRITICAL: DISK CRITICAL - free space: /srv 28458 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [17:43:40] RECOVERY - Disk space on elastic1017 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [17:55:05] 10Operations, 10ops-eqiad, 10DBA: db1099 memory issues - https://phabricator.wikimedia.org/T221502 (10Marostegui) [17:55:16] 10Operations, 10ops-eqiad, 10DBA: db1099 memory issues - https://phabricator.wikimedia.org/T221502 (10Marostegui) p:05Triage→03Normal [18:20:38] 10Operations, 10Core Platform Team, 10DBA, 10MediaWiki-Database, and 3 others: Special:Log on commons -- entire web request took longer than 60 seconds and timed out - https://phabricator.wikimedia.org/T221458 (10Jc86035) @Framawiki It //sounds// like the same issue but I'm not totally sure. [19:27:15] 10Operations, 10ops-eqiad, 10netops: Replace eqiad mgmt switches with EX4200s - https://phabricator.wikimedia.org/T213128 (10faidon) a:05Cmjohnson→03ayounsi I've surfaced the idea myself in the past, but the more I think about it the more I think it's not such a great idea at this point... - EX4200s wer... [20:05:38] (03CR) 10Krinkle: [C: 03+1] mwgrep: Also find Gadgets-definition message [puppet] - 10https://gerrit.wikimedia.org/r/504991 (owner: 10Jforrester) [20:46:54] 10Operations, 10Parsoid, 10RESTBase, 10VisualEditor, and 5 others: Consider stashing data-parsoid for VE - https://phabricator.wikimedia.org/T215956 (10mobrovac) Since in either case we don't need to have special Varnish rules, I think we need both the query parameter and UA matching in order to ease the t... [20:49:26] 10Operations, 10ops-eqsin: update PDUs for eqsin (asset tag and other info) - https://phabricator.wikimedia.org/T211368 (10faidon) >>! In T211368#4805848, @faidon wrote: > Can we add procurement task and purchase date immediately? It doesn't sound like there is an immediate blocker to this. ^ this is now done... [20:53:27] 10Operations, 10netops: Juniper network device audit - all sites - https://phabricator.wikimedia.org/T213843 (10faidon) I was looking at FY19-20 CapEx planning and ran an export of the Entitlement Report from Juniper's website. The output is... not very close to the truth. There are serial there that do not ma... [20:53:39] 10Operations, 10DC-Ops, 10netops: Juniper network device audit - all sites - https://phabricator.wikimedia.org/T213843 (10faidon) [20:53:55] 10Operations, 10DC-Ops, 10netops: Juniper network device audit - all sites - https://phabricator.wikimedia.org/T213843 (10faidon) [20:54:39] 10Operations, 10DC-Ops, 10netops: Juniper network device audit - all sites - https://phabricator.wikimedia.org/T213843 (10faidon) [21:00:15] 10Operations, 10DC-Ops, 10netops: Juniper network device audit - all sites - https://phabricator.wikimedia.org/T213843 (10ayounsi) yep! Juniper has been working on it since a few weeks in ticket: 2019-0408-0694 Based on https://docs.google.com/spreadsheets/d/1tJ-mqN4-g_NyvO24pRERxVTbX6AMe6lMG9YcO2840Vg/edit... [21:18:06] 10Operations, 10DC-Ops, 10netops: Inventorize network equipment in Netbox - https://phabricator.wikimedia.org/T221506 (10faidon) p:05Triage→03Normal [21:29:46] 10Operations, 10netops: Netbox report to validate network equipment data - https://phabricator.wikimedia.org/T221507 (10faidon) [21:43:52] PROBLEM - puppet last run on planet1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:10:22] RECOVERY - puppet last run on planet1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:11:56] PROBLEM - Disk space on webperf2001 is CRITICAL: DISK CRITICAL - free space: / 1556 MB (3% inode=97%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [22:28:06] PROBLEM - Check systemd state on webperf2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:28:22] PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:34:27] 10Operations, 10Performance-Team: webperf2001 is running ouf of disk space - https://phabricator.wikimedia.org/T221508 (10jijiki) [22:34:49] 10Operations, 10Performance-Team: webperf2001 is running ouf of disk space - https://phabricator.wikimedia.org/T221508 (10jijiki) p:05Triage→03High [22:37:48] Krinkle: if you are around, do you know why webperf2001 is full? [22:42:35] 10Operations, 10Performance-Team: webperf2001 is running ouf of disk space - https://phabricator.wikimedia.org/T221508 (10jijiki) [22:45:57] I am on a really slow connection, I am afraid I can't look into it more [22:46:55] 10Operations, 10Performance-Team: webperf2001 is running ouf of disk space - https://phabricator.wikimedia.org/T221508 (10Peachey88) [22:54:52] RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:49:27] (03PS15) 10Paladox: Gerrit: Add flogger javaopts [puppet] - 10https://gerrit.wikimedia.org/r/463519 (https://phabricator.wikimedia.org/T200739)