[00:16:07] RECOVERY - puppet last run on lvs5003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [00:58:05] PROBLEM - Host wtp2019 is DOWN: PING CRITICAL - Packet loss = 100% [01:41:49] (03CR) 10Andrew Bogott: "Yes! I suspected that this was a failover from an ipv6 failure but was looking in the completely wrong place. I'm going to give this a c" [puppet] - 10https://gerrit.wikimedia.org/r/505477 (https://phabricator.wikimedia.org/T221527) (owner: 10Alex Monk) [01:47:28] (03CR) 10Andrew Bogott: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15927/" [puppet] - 10https://gerrit.wikimedia.org/r/505477 (https://phabricator.wikimedia.org/T221527) (owner: 10Alex Monk) [01:47:36] (03PS4) 10Andrew Bogott: auth pdns: bind on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/505477 (https://phabricator.wikimedia.org/T221527) (owner: 10Alex Monk) [02:01:49] RECOVERY - Check for gridmaster host resolution UDP on cloudservices1003 is OK: DNS OK - 0.013 seconds response time (tools-sgegrid-master.tools.eqiad.wmflabs. 60 IN A 172.16.4.197) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:23:39] 10Operations, 10Traffic, 10cloud-services-team (Kanban): Update RIPE about changes in WMCS auth servers - https://phabricator.wikimedia.org/T221531 (10Andrew) [02:24:39] (03PS1) 10Alex Monk: Remove old labs 'main' region in-addr.arpa delegation [dns] - 10https://gerrit.wikimedia.org/r/505478 [02:25:24] 10Operations, 10Traffic, 10cloud-services-team (Kanban): Update RIPE about changes in WMCS auth servers - https://phabricator.wikimedia.org/T221531 (10Krenair) [02:36:35] PROBLEM - puppet last run on restbase1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:01:18] (03CR) 10KartikMistry: [C: 03+1] Use higher unmodified MT threshold for Indonesian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505220 (https://phabricator.wikimedia.org/T221353) (owner: 10Petar.petkovic) [03:03:11] RECOVERY - puppet last run on restbase1016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:45:19] PROBLEM - puppet last run on analytics1054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:54:34] (03CR) 10Thcipriani: [C: 03+1] "If zuul seems like it'll be happy with this, I'm happy with this." [puppet] - 10https://gerrit.wikimedia.org/r/504973 (https://phabricator.wikimedia.org/T182756) (owner: 10Hashar) [04:11:47] RECOVERY - puppet last run on analytics1054 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [05:16:49] PROBLEM - puppet last run on webperf2001 is CRITICAL: CRITICAL: Puppet last ran 23 hours ago [05:21:10] (03PS1) 10Marostegui: db-eqiad.php: Depool db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505481 (https://phabricator.wikimedia.org/T221502) [05:22:32] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505481 (https://phabricator.wikimedia.org/T221502) (owner: 10Marostegui) [05:23:35] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505481 (https://phabricator.wikimedia.org/T221502) (owner: 10Marostegui) [05:25:52] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1099 T221502 (duration: 01m 15s) [05:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:58] T221502: db1099 memory issues - https://phabricator.wikimedia.org/T221502 [05:25:59] !log Stop MySQL and reboot db1099 to see if memory errors clear up T221502 [05:26:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:57] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505481 (https://phabricator.wikimedia.org/T221502) (owner: 10Marostegui) [05:33:50] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1099 memory issues - https://phabricator.wikimedia.org/T221502 (10Marostegui) I rebooted the host to see if the memory errors would clear up, but it didn't happen, so I guess we have to either contact Dell or move the DIMM to a different slot and wait... [05:34:40] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505482 [05:36:07] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505482 (owner: 10Marostegui) [05:37:04] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505482 (owner: 10Marostegui) [05:38:18] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1099 (duration: 00m 54s) [05:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:19] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505483 [05:43:15] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505482 (owner: 10Marostegui) [05:53:44] 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Marostegui) [05:53:54] 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Marostegui) p:05Triage→03Normal [05:54:12] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505483 (owner: 10Marostegui) [05:55:11] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505483 (owner: 10Marostegui) [05:55:24] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505483 (owner: 10Marostegui) [05:56:21] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1099 (duration: 00m 53s) [05:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:40] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505484 [06:11:04] (03CR) 10ArielGlenn: "Looks ok but once again this shouldn't be merged until phab1003 is configured for rsync and actually producing dump files to be picked up." [puppet] - 10https://gerrit.wikimedia.org/r/437558 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn) [06:22:09] 10Operations, 10Performance-Team: webperf2001 is running out of disk space - https://phabricator.wikimedia.org/T221508 (10Gilles) The multiple statsv instances probably aren't due to the constant restarting, because the same is observed on webperf1001 with instances that were started a long time ago: ` nobody... [06:25:50] (03PS1) 10Giuseppe Lavagetto: Add Language::ucfirst overrides for php 7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505487 (https://phabricator.wikimedia.org/T219279) [06:26:54] (03CR) 10jerkins-bot: [V: 04-1] Add Language::ucfirst overrides for php 7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505487 (https://phabricator.wikimedia.org/T219279) (owner: 10Giuseppe Lavagetto) [06:27:50] 10Operations, 10MediaWiki-General-or-Unknown, 10serviceops, 10Core Platform Team (PHP7 (TEC4)), and 4 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Joe) @kchapman regarding point 1 above - I've prepared various... [06:28:35] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:29:07] 10Operations, 10Performance-Team: webperf2001 is running out of disk space - https://phabricator.wikimedia.org/T221508 (10Gilles) The answer is that statsv spawns workers as separate processes. By default the amount of workers if half the amount of logical CPUs, which works out to 2 workers on those machines (... [06:35:31] (03PS2) 10Giuseppe Lavagetto: Add Language::ucfirst overrides for php 7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505487 (https://phabricator.wikimedia.org/T219279) [06:36:51] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505484 (owner: 10Marostegui) [06:37:47] RECOVERY - Disk space on webperf2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [06:37:49] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505484 (owner: 10Marostegui) [06:38:50] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1099 (duration: 00m 53s) [06:39:08] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505484 (owner: 10Marostegui) [06:39:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:43] 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Marostegui) Keep in mind that db2033 can be decommissioned (it is on C6) T220070 [06:40:53] !log Upgrade dbstore1003 [06:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:55] RECOVERY - puppet last run on webperf2001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [06:50:02] 10Operations, 10PHP 7.2 support, 10Wikimedia-production-error: Investigate why a string literal changed in opcache (Fatal exception of type "ConfigException") - https://phabricator.wikimedia.org/T221347 (10Joe) >>! In T221347#5126824, @Urbanecm wrote: >>>! In T221347#5126757, @Joe wrote: >> Btw, logs don't t... [06:53:35] 10Operations, 10ops-codfw, 10decommission: Decommission db2033 - https://phabricator.wikimedia.org/T220070 (10Marostegui) [06:55:01] RECOVERY - Check systemd state on webperf2001 is OK: OK - running: The system is fully operational [07:00:05] Deploy window No deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190422T0700) [07:09:13] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational [07:12:27] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505496 [07:14:11] (03CR) 10ArielGlenn: [C: 03+1] "Looks like it removes all related cruft, as advertised." [puppet] - 10https://gerrit.wikimedia.org/r/505304 (https://phabricator.wikimedia.org/T216985) (owner: 10Dzahn) [07:14:49] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505496 (owner: 10Marostegui) [07:15:51] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505496 (owner: 10Marostegui) [07:16:54] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1099 (duration: 00m 53s) [07:16:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:34] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505497 [07:18:42] (03CR) 10Alexandros Kosiaris: [C: 03+1] icinga: remove google safe browsing monitoring [puppet] - 10https://gerrit.wikimedia.org/r/505304 (https://phabricator.wikimedia.org/T216985) (owner: 10Dzahn) [07:19:38] (03CR) 10Alexandros Kosiaris: [C: 03+1] "+1 (and yes puppet-merge, sync-git-upstream work over HTTPS so they won't be affected), but won't merge this on Easter week." [puppet] - 10https://gerrit.wikimedia.org/r/504973 (https://phabricator.wikimedia.org/T182756) (owner: 10Hashar) [07:21:47] (03CR) 10ArielGlenn: "> Also the reversing thing might work to break MediaWiki's password" [puppet] - 10https://gerrit.wikimedia.org/r/498429 (owner: 10Dzahn) [07:26:43] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505496 (owner: 10Marostegui) [07:33:27] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Increase traffic for db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505497 (owner: 10Marostegui) [07:36:11] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505497 (owner: 10Marostegui) [07:37:19] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1099 (duration: 00m 53s) [07:37:20] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505618 [07:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:28] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505497 (owner: 10Marostegui) [07:42:27] 10Operations, 10Performance-Team: webperf2001 is running out of disk space - https://phabricator.wikimedia.org/T221508 (10Gilles) 05Open→03Resolved a:03Gilles Restarting coal fixed it. I think it was still the consequence of the kafka maintenance last week, that left coal in a bad state. The restarted co... [07:49:42] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505618 (owner: 10Marostegui) [07:51:17] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505618 (owner: 10Marostegui) [07:53:31] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1099 (duration: 00m 54s) [07:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:00] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505618 (owner: 10Marostegui) [08:09:27] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:12:20] (03CR) 10Marostegui: [C: 03+1] update mariadb grants from phab1002 to phab1003 (comments only) [puppet] - 10https://gerrit.wikimedia.org/r/496120 (owner: 10Dzahn) [08:13:21] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:16:07] (03CR) 10Marostegui: "I am not sure I understand the update to dsa-check-ssacli, is that script supposed to use ssacli now instead of hpssacli?" [puppet] - 10https://gerrit.wikimedia.org/r/503334 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [08:17:54] (03CR) 10Marostegui: [C: 03+1] "Note that I haven't checked the MAC addresses, just the syntax" [puppet] - 10https://gerrit.wikimedia.org/r/504562 (https://phabricator.wikimedia.org/T219399) (owner: 10Jcrespo) [08:37:15] (03PS3) 10Alexandros Kosiaris: network::constants: Remove seemingly unused druid_analytics_hosts [puppet] - 10https://gerrit.wikimedia.org/r/505366 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [08:37:30] !log Upgrade dbstore1005 [08:37:31] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Double checked as well, this indeed does not exist anywhere, merging. Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/505366 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [08:37:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:46] (03CR) 10Alexandros Kosiaris: [C: 03+2] "PCC at https://puppet-compiler.wmflabs.org/compiler1002/15928/ is happy and everything looks as expected. Merging, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/505373 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [08:39:53] (03PS2) 10Alexandros Kosiaris: network::constants: Move various analytics special_hosts to hiera [puppet] - 10https://gerrit.wikimedia.org/r/505373 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [08:44:07] (03PS1) 10ArielGlenn: generate index.html file for ncr dumps once per pass over all wikis [dumps] - 10https://gerrit.wikimedia.org/r/505623 (https://phabricator.wikimedia.org/T221515) [08:47:35] !Log finished maintenance window on dbstore1003 and dbstore1005 [08:47:55] !log finished maintenance window on dbstore1003 and dbstore1005 [08:47:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:20] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Sigh, if only it was that easy. profile::mariadb::ferm is (wrongly ofc) a definition, not a class so it can't be included. We need to fix " [puppet] - 10https://gerrit.wikimedia.org/r/505406 (owner: 10Alex Monk) [09:03:19] PROBLEM - HHVM rendering on mw1277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [09:04:27] RECOVERY - HHVM rendering on mw1277 is OK: HTTP OK: HTTP/1.1 200 OK - 78242 bytes in 0.136 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:05:08] (03CR) 10Arturo Borrero Gonzalez: "The change is a bit more complex than this." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/505339 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [09:27:35] (03CR) 10Alexandros Kosiaris: "> Thanks for setting this up! We currently do not have a swagger spec set up. There were earlier concerns about access and reserving a key" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/505263 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris) [10:19:00] (03CR) 10Alexandros Kosiaris: [C: 03+2] "> We could probably work around this and resolve the IPs in the puppet manifests if necessary." [puppet] - 10https://gerrit.wikimedia.org/r/505356 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [10:22:21] (03PS2) 10Arturo Borrero Gonzalez: ldap: Add support for sudo rules in sssd client config [puppet] - 10https://gerrit.wikimedia.org/r/504817 (https://phabricator.wikimedia.org/T221225) (owner: 10BryanDavis) [10:23:26] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] ldap: Add support for sudo rules in sssd client config [puppet] - 10https://gerrit.wikimedia.org/r/504817 (https://phabricator.wikimedia.org/T221225) (owner: 10BryanDavis) [10:24:13] (03PS10) 10Alexandros Kosiaris: network::constants: Move puppet_frontends to using existing data instead [puppet] - 10https://gerrit.wikimedia.org/r/505356 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [10:25:12] (03CR) 10Alexandros Kosiaris: [C: 03+2] "> Also rsync is supposed to do the forward and reverse DNS lookups so IPv4+IPv6 should work (but will anyway test)" [puppet] - 10https://gerrit.wikimedia.org/r/505356 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [10:59:32] 10Operations, 10ops-codfw, 10decommission, 10cloud-services-team (Kanban): decommission: labtestservices2001.wikimedia.org - https://phabricator.wikimedia.org/T218022 (10aborrero) [11:29:49] (03PS1) 10Arturo Borrero Gonzalez: labtestservices2001: use spare role [puppet] - 10https://gerrit.wikimedia.org/r/505629 (https://phabricator.wikimedia.org/T218022) [11:31:09] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: labtestservices2001.wikimedia.org - https://phabricator.wikimedia.org/T218022 (10aborrero) [11:31:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labtestservices2001: use spare role [puppet] - 10https://gerrit.wikimedia.org/r/505629 (https://phabricator.wikimedia.org/T218022) (owner: 10Arturo Borrero Gonzalez) [11:35:45] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: decommission: labtestservices2001.wikimedia.org - https://phabricator.wikimedia.org/T218022 (10aborrero) 05Stalled→03Open p:05Triage→03Normal a:05aborrero→03RobH [11:49:36] 10Operations, 10Cloud-Services, 10Maps (Maps-data): Adding tags hstore GIN indexes to the OSM database on osmdb.eqiad.wmnet for performance - https://phabricator.wikimedia.org/T221541 (10edwardbetts) [12:40:27] PROBLEM - puppet last run on kafka-jumbo1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:53:16] 10Operations, 10Parsoid, 10serviceops, 10Patch-For-Review: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10ssastry) [12:53:19] 10Operations, 10ops-eqiad, 10DC-Ops, 10Parsoid, 10decommission: decom ruthenium - https://phabricator.wikimedia.org/T216062 (10ssastry) 05Open→03Resolved [12:58:05] 10Operations, 10Patch-For-Review: Replacement of network::constant's special_hosts - https://phabricator.wikimedia.org/T220894 (10Krenair) [13:00:56] (03CR) 10Alex Monk: "Yeah. I tried changing that in the child commit but it didn't work out, there's puppet errors in there." [puppet] - 10https://gerrit.wikimedia.org/r/505406 (owner: 10Alex Monk) [13:06:57] RECOVERY - puppet last run on kafka-jumbo1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:25:13] (03PS2) 10Alex Monk: mariadb: Replace role::mariadb::ferm with profile::mariadb::ferm [puppet] - 10https://gerrit.wikimedia.org/r/505406 [13:26:03] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Replace role::mariadb::ferm with profile::mariadb::ferm [puppet] - 10https://gerrit.wikimedia.org/r/505406 (owner: 10Alex Monk) [13:34:47] (03PS3) 10Alex Monk: mariadb: Replace role::mariadb::ferm with profile::mariadb::ferm [puppet] - 10https://gerrit.wikimedia.org/r/505406 [13:39:52] (03PS4) 10Alex Monk: mariadb: Replace role::mariadb::ferm with profile::mariadb::ferm [puppet] - 10https://gerrit.wikimedia.org/r/505406 [13:39:54] (03PS4) 10Alex Monk: network::constants: Move mysql_root_clients from special_hosts to hieradata [puppet] - 10https://gerrit.wikimedia.org/r/505407 (https://phabricator.wikimedia.org/T220894) [13:42:53] (03CR) 10Alex Monk: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/505407 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [14:12:19] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review: Export useful metrics from haproxy logs for Thumbor - https://phabricator.wikimedia.org/T220499 (10Gilles) I've added the relevant panels to your dashboard, mirroring the data we were tracking for nginx: {F28728717, size=full} [15:08:39] (03CR) 10Alex Monk: [C: 04-1] "ugh, right: Error: Evaluation Error: Error while evaluating a Resource Statement, Duplicate declaration: Class[Profile::Mariadb::Ferm] is " [puppet] - 10https://gerrit.wikimedia.org/r/505406 (owner: 10Alex Monk) [15:40:43] (03PS1) 10Faidon Liambotis: quotereviewer: support 2019-style Dell EMC quotes [software] - 10https://gerrit.wikimedia.org/r/505640 [15:41:39] (03CR) 10jerkins-bot: [V: 04-1] quotereviewer: support 2019-style Dell EMC quotes [software] - 10https://gerrit.wikimedia.org/r/505640 (owner: 10Faidon Liambotis) [15:47:40] (03PS1) 10Arturo Borrero Gonzalez: openstack: pdns: auth: use mariadb 10.1 on stretch [puppet] - 10https://gerrit.wikimedia.org/r/505641 (https://phabricator.wikimedia.org/T221463) [15:48:42] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: pdns: auth: use mariadb 10.1 on stretch [puppet] - 10https://gerrit.wikimedia.org/r/505641 (https://phabricator.wikimedia.org/T221463) (owner: 10Arturo Borrero Gonzalez) [15:52:28] jouncebot: now [15:52:28] For the next 15 hour(s) and 7 minute(s): No deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190422T0700) [15:52:37] PROBLEM - HHVM rendering on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:53:47] RECOVERY - HHVM rendering on mw1227 is OK: HTTP OK: HTTP/1.1 200 OK - 78749 bytes in 0.819 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:13:51] PROBLEM - puppet last run on ms-be1048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:30:36] 10Operations, 10netops: VCF license dissapeared - https://phabricator.wikimedia.org/T221553 (10ayounsi) 05Open→03Resolved p:05Triage→03Normal [16:40:17] RECOVERY - puppet last run on ms-be1048 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:57:01] (03PS1) 10Bmansurov: Turn off logging for CitationUsage and CitationUsagePageLoad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505643 (https://phabricator.wikimedia.org/T213969) [17:06:42] 10Operations, 10serviceops, 10vm-requests, 10Patch-For-Review, 10User-jijiki: Site: 4 VM request for kubernetes - https://phabricator.wikimedia.org/T220822 (10ayounsi) One typo: codfw has `10.64.32.18` and `2620:0:861:103:10:64:32:18` Other than that it looks all good. Some questions: Can it be done any... [17:09:06] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review: Export useful metrics from haproxy logs for Thumbor - https://phabricator.wikimedia.org/T220499 (10jijiki) @Gilles thank you! I added the relevant codfw ones [17:16:49] 10Operations, 10fundraising-tech-ops, 10netops: Network setup for frmon2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T221475 (10ayounsi) Does this need a public IP and NAT? Is it fine to push it anytime or sync up with you? [17:21:45] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review: Export useful metrics from haproxy logs for Thumbor - https://phabricator.wikimedia.org/T220499 (10Gilles) 05Open→03Resolved [17:40:38] 10Operations, 10netops: VCF license disappeared - https://phabricator.wikimedia.org/T221553 (10Aklapper) [17:53:45] PROBLEM - LVS HTTP IPv4 on zotero.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:54:57] RECOVERY - LVS HTTP IPv4 on zotero.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 138 bytes in 0.016 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:55:55] 10Operations, 10serviceops, 10vm-requests, 10Patch-For-Review, 10User-jijiki: Site: 4 VM request for kubernetes - https://phabricator.wikimedia.org/T220822 (10akosiaris) >>! In T220822#5128635, @ayounsi wrote: > One typo: > codfw has `10.64.32.18` and `2620:0:861:103:10:64:32:18` Indeed it's `10.192.0.1... [18:06:22] 10Operations, 10netops: Netbox report to validate network equipment data - https://phabricator.wikimedia.org/T221507 (10ayounsi) LibreNMS's API is very limited (eg. can't access the inventory in bulk for all devices, it also doesn't play well with LDAP auth), but it does make sens to query LibreNMS (most likel... [18:15:10] !log Add k8s BGP neighbors on cr1/2-codfw - T220822 [18:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:16] T220822: Site: 4 VM request for kubernetes - https://phabricator.wikimedia.org/T220822 [18:22:24] !log Add k8s BGP neighbors on cr1/2-eqiad - T220822 [18:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:30] T220822: Site: 4 VM request for kubernetes - https://phabricator.wikimedia.org/T220822 [18:27:16] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 5 others: Add security sensitive nodes to our kubernetes cluster - https://phabricator.wikimedia.org/T220821 (10ayounsi) [18:27:19] 10Operations, 10serviceops, 10vm-requests, 10Patch-For-Review, 10User-jijiki: Site: 4 VM request for kubernetes - https://phabricator.wikimedia.org/T220822 (10ayounsi) 05Open→03Resolved Sessions added and established. [18:41:45] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 5 others: Add security sensitive nodes to our kubernetes cluster - https://phabricator.wikimedia.org/T220821 (10akosiaris) 05Open→03Resolved kubernetes1005, kubernetes1006, kubernetes2005, kubernetes2006 added with specific... [18:41:51] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 5 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10akosiaris) [18:46:52] !log gilles@deploy1001 Synchronized php-1.34.0-wmf.1/includes/media/ThumbnailImage.php: T216499 Only apply high priority hint half the time (duration: 00m 53s) [18:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:57] T216499: Priority Hints origin trial - https://phabricator.wikimedia.org/T216499 [19:04:02] (03PS1) 10CRusnov: Add a check_netbox_report icinga check [puppet] - 10https://gerrit.wikimedia.org/r/505656 [19:04:40] (03CR) 10jerkins-bot: [V: 04-1] Add a check_netbox_report icinga check [puppet] - 10https://gerrit.wikimedia.org/r/505656 (owner: 10CRusnov) [19:08:47] (03PS1) 10CRusnov: Add a check_netbox_report icinga check [puppet] - 10https://gerrit.wikimedia.org/r/505657 (https://phabricator.wikimedia.org/T215378) [19:09:25] (03CR) 10jerkins-bot: [V: 04-1] Add a check_netbox_report icinga check [puppet] - 10https://gerrit.wikimedia.org/r/505657 (https://phabricator.wikimedia.org/T215378) (owner: 10CRusnov) [19:09:30] (03Abandoned) 10CRusnov: Add a check_netbox_report icinga check [puppet] - 10https://gerrit.wikimedia.org/r/505656 (owner: 10CRusnov) [19:10:26] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 5 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10akosiaris) I did some benchmarking and here's some first (rather impressive numbers) for kask This is with 750 simul... [19:11:30] (03PS2) 10CRusnov: Add a check_netbox_report icinga check [puppet] - 10https://gerrit.wikimedia.org/r/505657 (https://phabricator.wikimedia.org/T215378) [19:16:49] (03PS3) 10CRusnov: Add a check_netbox_report icinga check [puppet] - 10https://gerrit.wikimedia.org/r/505657 (https://phabricator.wikimedia.org/T215378) [19:18:20] (03CR) 10CRusnov: "Tested on af-netbox01, and it seems to perform as expected." [puppet] - 10https://gerrit.wikimedia.org/r/505657 (https://phabricator.wikimedia.org/T215378) (owner: 10CRusnov) [19:27:27] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 5 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10akosiaris) Just for posterity's sake, at ~1500 artificially simulated users the service started to crumble and starte... [19:45:49] 10Operations: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10ayounsi) [19:56:33] 10Operations, 10DC-Ops, 10netops: Inventorize network equipment in Netbox - https://phabricator.wikimedia.org/T221506 (10ayounsi) I do think it's something useful to track. And overall quite easy for a one time import. Running the following against LibreNMS DB: `lang=sql SELECT sysName, entPhysicalName, ent... [19:58:36] 10Operations, 10netops: Netbox report to validate network equipment data - https://phabricator.wikimedia.org/T221507 (10faidon) All excellent points :) I especially like the PDU & scs suggestion! To be honest, I wouldn't focus on the inventory part yet. Let's just start with some sanity check for device statu... [20:18:25] (03CR) 10CRusnov: "Few comments/replies." (032 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422) (owner: 10CRusnov) [20:21:39] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.641e+05 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [20:25:31] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [20:25:46] 10Operations, 10Thumbor, 10serviceops: Build Thumbor packages for buster - https://phabricator.wikimedia.org/T221562 (10jijiki) p:05Triage→03Normal [20:26:22] 10Operations, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10jijiki) [20:26:24] 10Operations, 10Thumbor, 10serviceops: Build Thumbor packages for buster - https://phabricator.wikimedia.org/T221562 (10jijiki) [20:29:33] (03PS1) 10Aklapper: Phab: Allow greg and aklapper to convert projects to subprojects/milestones [puppet] - 10https://gerrit.wikimedia.org/r/505667 [20:32:35] (03CR) 10Aklapper: "No idea if this change does allow passing the required parameters to the command. Someone please correct if it does not; see https://phabr" [puppet] - 10https://gerrit.wikimedia.org/r/505667 (owner: 10Aklapper) [20:38:37] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [20:38:41] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 170 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [21:19:03] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [21:26:51] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [21:28:05] 10Operations, 10DC-Ops, 10netops: Inventorize network equipment in Netbox - https://phabricator.wikimedia.org/T221506 (10faidon) Apparently Netbox allows for a [[ https://netbox.wikimedia.org/dcim/inventory-items/import/ | CSV import ]] even for inventory items. So, `ssh cr1-eqiad.wikimedia.org show chassi... [21:43:57] (03CR) 10C. Scott Ananian: [C: 03+1] "joe: can we do this now?" [puppet] - 10https://gerrit.wikimedia.org/r/368248 (https://phabricator.wikimedia.org/T117845) (owner: 10Fomafix) [21:49:37] (03CR) 10Ayounsi: "I'm wondering how resource intensive is generating a report, and if there is a risk of DOSing Netbox with Icinga." [puppet] - 10https://gerrit.wikimedia.org/r/505657 (https://phabricator.wikimedia.org/T215378) (owner: 10CRusnov) [21:54:21] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 46.15% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [22:02:29] PROBLEM - puppet last run on scandium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:04:47] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [22:11:18] (03CR) 10CRusnov: "Well it takes some wall time, for example the Coherence report takes about 10 seconds on the test server, but if we can control the freque" [puppet] - 10https://gerrit.wikimedia.org/r/505657 (https://phabricator.wikimedia.org/T215378) (owner: 10CRusnov) [22:28:59] RECOVERY - puppet last run on scandium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:40:11] (03CR) 10CRusnov: coherence report: General improvements and rack checks (031 comment) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422) (owner: 10CRusnov) [22:40:41] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [22:44:33] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [22:47:18] PROBLEM - LVS HTTP IPv4 on zotero.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [22:48:22] 10Operations, 10DC-Ops, 10netops: Inventorize network equipment in Netbox - https://phabricator.wikimedia.org/T221506 (10faidon) OK for switches, this did the trick: ` #!/usr/bin/perl use strict; use warnings; my $template = $ARGV[0]; my $device; while () { chomp; if (/FPC (\d)/) {... [22:48:27] RECOVERY - LVS HTTP IPv4 on zotero.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 138 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [22:51:15] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) is WARNING: Test Ensure Zotero is working responds with unexpected value at path [0]/itemType = webpage https://wikitech.wikimedia.org/wiki/Citoid [22:53:45] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [22:57:45] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [23:09:25] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [23:10:34] 10Operations, 10DC-Ops, 10netops: Inventorize network equipment in Netbox - https://phabricator.wikimedia.org/T221506 (10ayounsi) 05Open→03Resolved pfw and fasw devices added. I think everything that can be done here is done. [23:13:11] 10Operations: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10ayounsi) [23:43:44] 10Operations: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10faidon) [23:44:20] 10Operations: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10faidon)