[00:00:04] twentyafterfour: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180531T0000). [00:00:30] (03CR) 10Dzahn: [C: 032] Planet: Fix path to libs [puppet] - 10https://gerrit.wikimedia.org/r/436428 (owner: 10Paladox) [00:04:12] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - maintain-dbusers is active [00:04:21] RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational [00:12:27] (03PS7) 10Thcipriani: Scap: scap_source correct gid [puppet] - 10https://gerrit.wikimedia.org/r/361796 [00:20:02] (03PS1) 10Alex Monk: deployment-prep: Update BounceHandlerInternalIPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436430 (https://phabricator.wikimedia.org/T184244) [00:21:45] (03PS1) 10Alex Monk: deployment-prep: Update wikimail_smarthost [puppet] - 10https://gerrit.wikimedia.org/r/436431 (https://phabricator.wikimedia.org/T184244) [00:22:42] 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Puppet broken on deployment-mx due to systemd on trusty - https://phabricator.wikimedia.org/T184244#4244761 (10Krenair) a:03Krenair [00:27:24] (03CR) 10Alex Monk: "need to check some things before I schedule this for SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436430 (https://phabricator.wikimedia.org/T184244) (owner: 10Alex Monk) [00:29:50] (03Draft1) 10Paladox: Planet: One more fix for library [puppet] - 10https://gerrit.wikimedia.org/r/436432 [00:29:52] (03PS2) 10Paladox: Planet: One more fix for library [puppet] - 10https://gerrit.wikimedia.org/r/436432 [00:36:58] (03CR) 10Dzahn: [C: 032] Planet: One more fix for library [puppet] - 10https://gerrit.wikimedia.org/r/436432 (owner: 10Paladox) [00:41:11] (03PS1) 10Thcipriani: Beta: Add librenms dsh file [puppet] - 10https://gerrit.wikimedia.org/r/436433 (https://phabricator.wikimedia.org/T192561) [00:45:33] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Patch-For-Review: Upgrade deployment-prep deployment servers to stretch - https://phabricator.wikimedia.org/T192561#4244792 (10thcipriani) >>! In T192561#4183530, @thcipriani wrote: > Broken stuff > ========= > 3. iegreview has inv... [01:05:57] !log $lang.planet.wikimedia.org is changing software from planet-venus to rawdog [01:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:11:29] (03PS1) 10Krinkle: webperf: Fix jumbo-eqiad reference to be compatible with Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/436435 (https://phabricator.wikimedia.org/T195314) [01:11:39] thcipriani: ^ [01:11:49] (03CR) 10Krinkle: "Cherry picking to beta puppetmaster now to confirm." [puppet] - 10https://gerrit.wikimedia.org/r/436435 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [01:12:21] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties [01:13:01] PROBLEM - Check systemd state on kafka-jumbo1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:14:01] RECOVERY - Check systemd state on kafka-jumbo1002 is OK: OK - running: The system is fully operational [01:14:31] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1002 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties [01:15:16] o this ^ [01:16:20] (03PS1) 10Ottomata: Blacklist cirrusSearch.* job queue topics from main-eqiad -> jumbo-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/436436 (https://phabricator.wikimedia.org/T196032) [01:18:56] (03CR) 10Ottomata: [C: 032] Blacklist cirrusSearch.* job queue topics from main-eqiad -> jumbo-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/436436 (https://phabricator.wikimedia.org/T196032) (owner: 10Ottomata) [01:18:57] (03PS1) 10Ottomata: Regex fix for cirrusSearch job queue topic blacklist for Kafka MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/436437 (https://phabricator.wikimedia.org/T196032) [01:19:12] (03CR) 10Ottomata: [V: 032 C: 032] Regex fix for cirrusSearch job queue topic blacklist for Kafka MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/436437 (https://phabricator.wikimedia.org/T196032) (owner: 10Ottomata) [01:30:27] !log pnorman@tin Started deploy [tilerator/deploy@78d1b82] (cleartables): Deploy updated stylesheet [01:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:30:51] !log pnorman@tin Finished deploy [tilerator/deploy@78d1b82] (cleartables): Deploy updated stylesheet (duration: 00m 25s) [01:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:35:52] 10Operations, 10Wikimedia-Planet, 10Patch-For-Review: planet.wikimedia.org: replace planet-venus software with rawdog - https://phabricator.wikimedia.org/T180498#4244845 (10Dzahn) The backend of planet in misc-varnish has been switched to planet2001 which is on stretch and uses rawdog. Updates ran.. Some fi... [01:38:26] !log pnorman@tin Started deploy [tilerator/deploy@78d1b82] (cleartables): Deploy updated stylesheet [01:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:38:31] !log pnorman@tin Finished deploy [tilerator/deploy@78d1b82] (cleartables): Deploy updated stylesheet (duration: 00m 05s) [01:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:39:01] !log pnorman@tin Started deploy [tilerator/deploy@fad9969] (cleartables): Deploy updated stylesheet [01:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:39:24] 10Operations, 10Wikimedia-Planet, 10Patch-For-Review: planet.wikimedia.org: replace planet-venus software with rawdog - https://phabricator.wikimedia.org/T180498#4244847 (10Paladox) [01:39:25] !log pnorman@tin Finished deploy [tilerator/deploy@fad9969] (cleartables): Deploy updated stylesheet (duration: 00m 24s) [01:39:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:39:48] 10Operations, 10Wikimedia-Planet, 10Patch-For-Review: planet.wikimedia.org: replace planet-venus software with rawdog - https://phabricator.wikimedia.org/T180498#3759794 (10Paladox) [01:40:16] 10Operations, 10Wikimedia-Planet, 10Patch-For-Review: upgrade planet instances to stretch - https://phabricator.wikimedia.org/T168490#4244850 (10Paladox) [01:40:22] 10Operations, 10Wikimedia-Planet, 10Patch-For-Review: planet.wikimedia.org: replace planet-venus software with rawdog - https://phabricator.wikimedia.org/T180498#3759794 (10Paladox) 05Open>03Resolved [01:57:32] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10Performance-Team: Include role/common in beta-cluster hieradata hierarchy - https://phabricator.wikimedia.org/T196034#4244858 (10Krinkle) [01:58:48] (03PS1) 10Krinkle: deployment-prep: Remove override for scap::sources [puppet] - 10https://gerrit.wikimedia.org/r/436439 (https://phabricator.wikimedia.org/T195314) [01:59:16] (03CR) 10jerkins-bot: [V: 04-1] deployment-prep: Remove override for scap::sources [puppet] - 10https://gerrit.wikimedia.org/r/436439 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [02:00:00] (03PS2) 10Krinkle: deployment-prep: Remove override for scap::sources [puppet] - 10https://gerrit.wikimedia.org/r/436439 (https://phabricator.wikimedia.org/T195314) [02:00:02] (03PS1) 10Krinkle: puppetmaster: Add role_hierarchy to labs.hiera [puppet] - 10https://gerrit.wikimedia.org/r/436440 (https://phabricator.wikimedia.org/T196034) [02:00:30] (03CR) 10jerkins-bot: [V: 04-1] deployment-prep: Remove override for scap::sources [puppet] - 10https://gerrit.wikimedia.org/r/436439 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [02:01:43] !log pnorman@tin Started deploy [tilerator/deploy@78448de] (cleartables): Deploy style with fewer fonts [02:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:02:08] !log pnorman@tin Finished deploy [tilerator/deploy@78448de] (cleartables): Deploy style with fewer fonts (duration: 00m 25s) [02:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:08:44] !log pnorman@tin Started deploy [tilerator/deploy@2a26f1e] (cleartables): Deploy style with fewer fonts [02:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:09:08] !log pnorman@tin Finished deploy [tilerator/deploy@2a26f1e] (cleartables): Deploy style with fewer fonts (duration: 00m 24s) [02:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:03:10] !log l10nupdate@tin scap sync-l10n completed (1.32.0-wmf.5) (duration: 16m 18s) [03:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:27:12] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 777.93 seconds [03:38:49] 10Operations, 10Wikimedia-Planet, 10Patch-For-Review: upgrade planet instances to stretch - https://phabricator.wikimedia.org/T168490#4244957 (10Dzahn) proper revert to go back to planet-venus would be to first revert https://gerrit.wikimedia.org/r/#/c/436427/2/hieradata/role/common/cache/misc.yaml and the... [04:05:09] !log l10nupdate@tin scap sync-l10n completed (1.32.0-wmf.6) (duration: 18m 15s) [04:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:09:32] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 170.49 seconds [04:18:19] !log rebooting wtp2001/wtp2015 for microcode updates [04:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:19:57] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu May 31 04:19:57 UTC 2018 (duration 14m 48s) [04:20:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:37:21] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10Performance-Team, 10Patch-For-Review: Include role/common in beta-cluster hieradata hierarchy - https://phabricator.wikimedia.org/T196034#4244992 (10MoritzMuehlenhoff) p:05Triage>03Normal [04:40:02] (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for Burrow Prometheus exporters [puppet] - 10https://gerrit.wikimedia.org/r/434934 (https://phabricator.wikimedia.org/T135991) [05:13:16] !log installing python-crypto security updates on trusty [05:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:17] (03PS1) 10Marostegui: db-codfw.php: Depool db2084:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436446 (https://phabricator.wikimedia.org/T191316) [05:17:42] (03CR) 10jerkins-bot: [V: 04-1] db-codfw.php: Depool db2084:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436446 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [05:19:32] (03PS2) 10Marostegui: db-codfw.php: Depool db2084:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436446 (https://phabricator.wikimedia.org/T191316) [05:21:27] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2084:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436446 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [05:22:57] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2084:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436446 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [05:23:14] (03CR) 10jenkins-bot: db-codfw.php: Depool db2084:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436446 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [05:24:53] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2084:3315 for alter table (duration: 01m 27s) [05:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:00] !log Deploy schema change on db2084:3315 - T191316 T192926 T89737 T195193 [05:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:06] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [05:25:07] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [05:25:07] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [05:25:07] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [05:40:41] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [05:40:56] !log restart pdfrender on scb1001 [05:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:55] 10Operations, 10ops-codfw, 10Patch-For-Review: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339#4245015 (10Marostegui) [05:51:42] !log delete /tmp/scap_l10n_1501525840,scap_l10n_1501525840,l10nstuff,l10nstuff3 from tin to free some space in the root partition (1.9G left) [05:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:50] other than this, I can see the following [05:53:50] elukey@tin:/var/lib/l10nupdate$ sudo du -hs * | sort -h [05:53:50] 4.9G mediawiki [05:53:51] 6.5G caches [05:54:49] and a couple of big home dirs [05:55:16] but I have no idea about what can be cleaned or not.. [05:59:53] !log reimage druid1005 to Debian Stretch [05:59:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:04] (03PS2) 10Nehajha: Read command line arguments from a config file [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/435691 (https://phabricator.wikimedia.org/T148872) [06:09:47] !log reimage ganeti1003, ganeti1007 to stretch [06:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:19] (03PS1) 10Alexandros Kosiaris: Reimage ganeti100{3,4,7,8} to stretch [puppet] - 10https://gerrit.wikimedia.org/r/436447 [06:12:18] (03CR) 10Alexandros Kosiaris: [C: 032] Reimage ganeti100{3,4,7,8} to stretch [puppet] - 10https://gerrit.wikimedia.org/r/436447 (owner: 10Alexandros Kosiaris) [06:12:32] (03PS2) 10Alexandros Kosiaris: Beta: Add librenms dsh file [puppet] - 10https://gerrit.wikimedia.org/r/436433 (https://phabricator.wikimedia.org/T192561) (owner: 10Thcipriani) [06:12:41] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Beta: Add librenms dsh file [puppet] - 10https://gerrit.wikimedia.org/r/436433 (https://phabricator.wikimedia.org/T192561) (owner: 10Thcipriani) [06:13:22] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2084:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436448 [06:17:55] (03PS1) 10Elukey: Override druid1005's zookeeper settings [puppet] - 10https://gerrit.wikimedia.org/r/436450 (https://phabricator.wikimedia.org/T192636) [06:18:28] (03CR) 10Elukey: [C: 032] Override druid1005's zookeeper settings [puppet] - 10https://gerrit.wikimedia.org/r/436450 (https://phabricator.wikimedia.org/T192636) (owner: 10Elukey) [06:24:52] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2084:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436448 (owner: 10Marostegui) [06:25:31] (03CR) 10Alexandros Kosiaris: [C: 032] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/436341 (owner: 10Alex Monk) [06:26:21] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2084:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436448 (owner: 10Marostegui) [06:27:36] (03PS6) 10Alexandros Kosiaris: Allow use of PuppetDB in labs for ssh_known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/333471 (https://phabricator.wikimedia.org/T72792) (owner: 10Alex Monk) [06:28:00] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Allow use of PuppetDB in labs for ssh_known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/333471 (https://phabricator.wikimedia.org/T72792) (owner: 10Alex Monk) [06:28:12] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2084:3315 after alter table (duration: 01m 28s) [06:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:39] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2084:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436448 (owner: 10Marostegui) [06:30:03] !log Deploy schema change on db2092:3315 and db2094:3315 - T191316 T192926 T89737 T195193 [06:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:09] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [06:30:09] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [06:30:10] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [06:30:10] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [06:32:27] (03PS1) 10Marostegui: db-codfw.php: Depool db2066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436456 (https://phabricator.wikimedia.org/T191316) [06:34:27] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436456 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [06:35:06] (03PS1) 10Alexandros Kosiaris: d-i: Don't create swap on kubernetes nodes [puppet] - 10https://gerrit.wikimedia.org/r/436458 [06:35:52] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436456 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [06:36:45] (03CR) 10Alexandros Kosiaris: [C: 032] d-i: Don't create swap on kubernetes nodes [puppet] - 10https://gerrit.wikimedia.org/r/436458 (owner: 10Alexandros Kosiaris) [06:38:28] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2066 for alter table (duration: 01m 21s) [06:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:33] !log Deploy schema change on db2066 - T191316 T192926 T89737 T195193 [06:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:39] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [06:38:39] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [06:38:40] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [06:38:40] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [06:41:36] (03CR) 10jenkins-bot: db-codfw.php: Depool db2066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436456 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [06:45:20] (03PS1) 10Gehel: maps: install fonts-noto-unhinted [puppet] - 10https://gerrit.wikimedia.org/r/436463 (https://phabricator.wikimedia.org/T195474) [06:48:16] akosiaris: all of cloud vps is failing puppet runs [06:48:28] great! [06:48:42] :( [06:49:08] May 31 06:38:25 extdist-01 puppet-agent[12377]: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, Failed to parse template ssh/known_hosts.erb: [06:49:09] May 31 06:38:25 extdist-01 puppet-agent[12377]: Filepath: /etc/puppet/modules/puppetdbquery/lib/puppet/parser/functions/query_resources.rb [06:49:09] May 31 06:38:25 extdist-01 puppet-agent[12377]: Line: 46 [06:49:09] May 31 06:38:25 extdist-01 puppet-agent[12377]: Detail: undefined method `server_urls' for # [06:49:09] May 31 06:38:25 extdist-01 puppet-agent[12377]: at /etc/puppet/modules/ssh/manifests/client.pp:8:24 on node extdist-01.extdist.eqiad.wmflabs [06:49:10] May 31 06:38:25 extdist-01 puppet-agent[12377]: Not using cache on failed catalog [06:49:51] * akosiaris looking [06:49:59] let's see if this can be solved without reverting [06:51:44] how on earth is that part of the code even reached ? [06:51:49] it shouldn't [06:52:45] 10Operations, 10ops-ulsfo, 10Traffic, 10netops: troubleshoot cr3/cr4 link - https://phabricator.wikimedia.org/T196030#4245080 (10ayounsi) Thanks, codfw uses MMF with those optics [[ https://apps.juniper.net/hct/model/?component=QFX-QSFP-40G-SR4 | QSFP+-40G-SR4 ]] ulsfo is SMF with [[ https://apps.juniper.n... [07:00:21] ah found it [07:00:30] settings::storeconfigs_backend defaults to puppetdb now [07:07:01] (03PS1) 10Marostegui: Revert "mariadb: Convert db2092 to sanitarium multi-instance" [puppet] - 10https://gerrit.wikimedia.org/r/436467 [07:07:20] (03CR) 10jerkins-bot: [V: 04-1] Revert "mariadb: Convert db2092 to sanitarium multi-instance" [puppet] - 10https://gerrit.wikimedia.org/r/436467 (owner: 10Marostegui) [07:07:57] (03CR) 10Pnorman: [C: 031] maps: install fonts-noto-unhinted [puppet] - 10https://gerrit.wikimedia.org/r/436463 (https://phabricator.wikimedia.org/T195474) (owner: 10Gehel) [07:08:12] (03CR) 10Gehel: [C: 032] maps: install fonts-noto-unhinted [puppet] - 10https://gerrit.wikimedia.org/r/436463 (https://phabricator.wikimedia.org/T195474) (owner: 10Gehel) [07:09:22] (03PS1) 10Alexandros Kosiaris: realm.pp: check storeconfigs setting for use_puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/436468 [07:09:29] let's see if this fixes it [07:11:06] (03PS1) 10Marostegui: mariadb: Move db2092 back to s1 [puppet] - 10https://gerrit.wikimedia.org/r/436469 (https://phabricator.wikimedia.org/T190704) [07:12:00] (03PS2) 10Marostegui: mariadb: Move db2092 back to s1 [puppet] - 10https://gerrit.wikimedia.org/r/436469 (https://phabricator.wikimedia.org/T190704) [07:12:22] (03Abandoned) 10Marostegui: Revert "mariadb: Convert db2092 to sanitarium multi-instance" [puppet] - 10https://gerrit.wikimedia.org/r/436467 (owner: 10Marostegui) [07:13:09] (03CR) 10Marostegui: [C: 032] mariadb: Move db2092 back to s1 [puppet] - 10https://gerrit.wikimedia.org/r/436469 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [07:14:03] PROBLEM - Host ganeti1007 is DOWN: PING CRITICAL - Packet loss = 100% [07:14:13] PROBLEM - puppet last run on maps-test2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[fonts-noto-unhinted] [07:14:14] RECOVERY - Host ganeti1007 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [07:15:14] PROBLEM - Host ganeti1003 is DOWN: PING CRITICAL - Packet loss = 100% [07:15:52] 10Operations, 10ops-ulsfo, 10Traffic, 10netops: troubleshoot cr3/cr4 link - https://phabricator.wikimedia.org/T196030#4245085 (10ayounsi) a:05ayounsi>03RobH [07:17:53] RECOVERY - Host ganeti1003 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [07:19:32] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4245086 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['db2... [07:19:43] PROBLEM - puppet last run on maps1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[fonts-noto-unhinted] [07:20:19] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436473 [07:20:43] RECOVERY - cassandra CQL 10.192.16.35:9042 on maps-test2004 is OK: TCP OK - 0.036 second response time on 10.192.16.35 port 9042 [07:21:13] (03PS2) 10Alexandros Kosiaris: realm.pp: check storeconfigs setting for use_puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/436468 [07:21:52] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436473 (owner: 10Marostegui) [07:22:57] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436473 (owner: 10Marostegui) [07:23:13] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436473 (owner: 10Marostegui) [07:23:38] (03PS1) 10Marostegui: Revert "sX.hosts: db2092 is now multiinstance" [software] - 10https://gerrit.wikimedia.org/r/436474 [07:23:44] PROBLEM - puppet last run on maps1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[fonts-noto-unhinted] [07:23:44] (03CR) 10jerkins-bot: [V: 04-1] Revert "sX.hosts: db2092 is now multiinstance" [software] - 10https://gerrit.wikimedia.org/r/436474 (owner: 10Marostegui) [07:23:53] (03Abandoned) 10Marostegui: Revert "sX.hosts: db2092 is now multiinstance" [software] - 10https://gerrit.wikimedia.org/r/436474 (owner: 10Marostegui) [07:24:14] PROBLEM - puppet last run on maps-test2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[fonts-noto-unhinted] [07:24:33] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2066 after alter table (duration: 01m 22s) [07:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:53] (03PS1) 10Marostegui: sX.hosts: Move db2092 back to s1 [software] - 10https://gerrit.wikimedia.org/r/436475 [07:26:53] PROBLEM - puppet last run on maps2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[fonts-noto-unhinted] [07:26:56] (03CR) 10Marostegui: [C: 032] sX.hosts: Move db2092 back to s1 [software] - 10https://gerrit.wikimedia.org/r/436475 (owner: 10Marostegui) [07:27:42] (03Merged) 10jenkins-bot: sX.hosts: Move db2092 back to s1 [software] - 10https://gerrit.wikimedia.org/r/436475 (owner: 10Marostegui) [07:27:43] !log Deploy schema change on s5 codfw master (db2052) this will generate lag on codfw - T191316 T192926 T89737 T195193 [07:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:50] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [07:27:50] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [07:27:50] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [07:27:50] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [07:29:39] PROBLEM - puppet last run on maps2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[fonts-noto-unhinted] [07:30:11] (03CR) 10Giuseppe Lavagetto: [C: 031] realm.pp: check storeconfigs setting for use_puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/436468 (owner: 10Alexandros Kosiaris) [07:30:18] !log disable puppet for https://gerrit.wikimedia.org/r/#/c/436468/ merge cross fleet [07:30:20] (03PS1) 10Gehel: maps: fonts-noto-unhinted only available on stretch [puppet] - 10https://gerrit.wikimedia.org/r/436476 (https://phabricator.wikimedia.org/T195474) [07:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:24] let's do this carefully [07:30:30] gehel: FYI ^ [07:30:37] since I see you merging puppet changes [07:31:02] (03CR) 10Pnorman: [C: 031] maps: fonts-noto-unhinted only available on stretch [puppet] - 10https://gerrit.wikimedia.org/r/436476 (https://phabricator.wikimedia.org/T195474) (owner: 10Gehel) [07:31:06] sorry for taking the GIL on this one, I promise to be quick [07:31:27] (03CR) 10Alexandros Kosiaris: [C: 032] realm.pp: check storeconfigs setting for use_puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/436468 (owner: 10Alexandros Kosiaris) [07:35:00] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1968 bytes in 0.095 second response time [07:35:11] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "The role hierarchy cannot work with horizon at all as horizon doesn't declare the variable $_roles; also, we have plans to mostly replace " [puppet] - 10https://gerrit.wikimedia.org/r/436440 (https://phabricator.wikimedia.org/T196034) (owner: 10Krinkle) [07:35:45] ok this was a noop on bast3002 [07:35:51] 10Operations, 10Wikimedia-Mailing-lists: New closed communication public policy mailing list needed - https://phabricator.wikimedia.org/T196041#4245094 (10Dimi_z) [07:36:03] running a few more tests and then I 'll release the GIL [07:38:29] PROBLEM - MariaDB Slave Lag: s5 on db2052 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 618.28 seconds [07:38:37] (03PS2) 10Gehel: maps: fonts-noto-unhinted only available on stretch [puppet] - 10https://gerrit.wikimedia.org/r/436476 (https://phabricator.wikimedia.org/T195474) [07:38:54] legoktm: fixed [07:39:33] akosiaris: thanks :) I'll expect the storm of IRC messages about recovery to show up soon [07:40:10] !log re-enable puppet across the fleet [07:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:39] !log Stop Replication on db2066 [07:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:50] PROBLEM - puppet last run on maps-test2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[fonts-noto-unhinted] [07:41:50] PROBLEM - puppet last run on maps2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[fonts-noto-unhinted] [07:42:09] PROBLEM - puppet last run on maps1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[fonts-noto-unhinted] [07:42:25] (03CR) 10Gehel: [C: 032] maps: fonts-noto-unhinted only available on stretch [puppet] - 10https://gerrit.wikimedia.org/r/436476 (https://phabricator.wikimedia.org/T195474) (owner: 10Gehel) [07:42:59] PROBLEM - puppet last run on maps1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[fonts-noto-unhinted] [07:43:19] PROBLEM - puppet last run on maps2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[fonts-noto-unhinted] [07:43:48] maps* should be recovering ... [07:43:59] PROBLEM - puppet last run on maps1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[fonts-noto-unhinted] [07:44:04] gehel: morning! [07:44:07] gehel: elastic2018 has been down for the past 8h FYI [07:44:29] PROBLEM - puppet last run on maps-test2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[fonts-noto-unhinted] [07:44:30] PROBLEM - puppet last run on maps-test2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[fonts-noto-unhinted] [07:44:40] PROBLEM - puppet last run on maps2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[fonts-noto-unhinted] [07:44:50] PROBLEM - puppet last run on maps1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[fonts-noto-unhinted] [07:44:52] ema: checking [07:45:09] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1942 bytes in 0.168 second response time [07:45:53] gehel: no useful iLO output, perhaps it needs a 'power reset'? [07:46:34] ema: sounds like a good option. You're already on the console? [07:46:59] RECOVERY - puppet last run on maps2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:47:16] gehel: yeah, ok to reset? [07:47:34] ema: please! [07:47:49] 10Operations, 10Maps-Sprint, 10Patch-For-Review: reimage maps-test2004 to stretch and cassandra 2.2 - https://phabricator.wikimedia.org/T195741#4245125 (10Pnorman) Yes, it's confirmed to be that issue. The workaround was @gehel installing the `python-cassandra` package, then using `CQLSH_NO_BUNDLED=TRUE cqls... [07:48:16] !log power-cycle elastic2018 [07:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:00] RECOVERY - puppet last run on maps1003 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [07:49:00] ema: thanks! [07:49:30] RECOVERY - puppet last run on maps-test2003 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [07:49:59] RECOVERY - puppet last run on maps1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:53:10] 10Operations, 10monitoring, 10Patch-For-Review: Evaluate Grafana's LDAP group options and deprecate grafana-admin if possible - https://phabricator.wikimedia.org/T170150#4245132 (10akosiaris) Deleting a user doesn't seem to cause issues in my tests, e.g. the Version History feature just stops listing the use... [07:54:06] akosiaris: confirmed, hosts are recovering now :) [07:54:12] :) [07:54:16] gehel: yw! It doesn't seem to be enough unfortunately [07:54:45] !log reimage kubernetes1004 without swap [07:54:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:50] RECOVERY - puppet last run on maps2003 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [07:55:20] PROBLEM - Host kubernetes1004 is DOWN: PING CRITICAL - Packet loss = 100% [07:55:27] 10Operations, 10Cassandra, 10Discovery, 10Maps: cassandra 2.2.6-wmf4 is not compatible with python 2.7.13 (debian stretch) - https://phabricator.wikimedia.org/T196044#4245147 (10Gehel) [07:55:48] ema: you're still on the shell? Does it say anything? [07:55:56] lol supposedly wmf-auto-reimage downtimed kubernetes1004 [07:56:09] RECOVERY - Host kubernetes1004 is UP: PING WARNING - Packet loss = 44%, RTA = 0.24 ms [07:56:23] gehel: I've logged out, no useful output after "Press 'ESC (' to return to the CLI Session." [07:57:00] RECOVERY - puppet last run on maps-test2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:57:11] ema: yeah, not very useful... thanks for trying! [07:57:19] PROBLEM - Mathoid LVS eqiad on mathoid.svc.eqiad.wmnet is CRITICAL: /{format}/ (mass-energy equivalence (complete)) timed out before a response was received: / (mass-energy equivalence (json)) timed out before a response was received [07:57:38] hmm that should not have happened [07:57:39] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - mathoid_10042: Servers kubernetes1004.eqiad.wmnet are marked down but pooled [07:58:07] (03PS1) 10Marostegui: install_server: Allow reimage db2092 [puppet] - 10https://gerrit.wikimedia.org/r/436478 (https://phabricator.wikimedia.org/T190704) [07:58:09] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([kubernetes1004.eqiad.wmnet]) [07:58:15] hmmm [07:58:18] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - mathoid_10042: Servers kubernetes1004.eqiad.wmnet are marked down but pooled [07:58:23] kubernetes1004 is expected [07:58:35] (03CR) 10星耀晨曦: [C: 04-1] "Just add a right to two user groups." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436213 (https://phabricator.wikimedia.org/T195247) (owner: 10Zoranzoki21) [07:58:38] RECOVERY - puppet last run on maps2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:58:44] but the LVS service alerting should not have happened [07:58:49] (03CR) 10Marostegui: [C: 032] install_server: Allow reimage db2092 [puppet] - 10https://gerrit.wikimedia.org/r/436478 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [08:01:19] pods have indeed been scheduled on the other hosts [08:01:23] <_joe_> akosiaris: that it's pooled if dowm? not really [08:01:38] PROBLEM - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([kubernetes1004.eqiad.wmnet]) [08:01:55] _joe_: PROBLEM - Mathoid LVS eqiad on mathoid.svc.eqiad.wmnet is CRITICAL: /{format}/ (mass-energy equivalence (complete)) timed out before a response was received: / (mass-energy equivalence (json)) timed out before a response was received [08:02:02] that's what I am ^ talking about [08:02:13] pybal complaining is fully expected [08:02:18] RECOVERY - puppet last run on maps1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:02:24] <_joe_> akosiaris: that could be due to the traffic being router to 1004 [08:02:51] the other LVS check did not complain (much) btw [08:03:04] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4245167 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['db2... [08:03:08] RECOVERY - puppet last run on maps1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:03:12] it recovered pretty instantly without even paging [08:03:19] <_joe_> akosiaris: oh I know what happened [08:03:25] please do tell [08:03:26] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4245168 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2092.codfw.wmnet'] ``` Of which those **FAILED**: ```... [08:03:27] <_joe_> you still have the scbs in the pool [08:03:33] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4245169 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['db2... [08:03:43] <_joe_> so pybal cannot depool any k8s node [08:03:59] wouldn't that hurt both checks though ? [08:04:02] <_joe_> https://dpaste.de/LP1g [08:04:11] what you say btw is true IIRC [08:04:12] checking [08:04:18] <_joe_> see my paste [08:04:29] <_joe_> the depool threshold is .5 for it [08:04:32] ok lemme set them as inactive and then remove them [08:04:39] <_joe_> ok [08:05:21] kubernetes btw reacted fine to the node not being there but it is still reporting the pods as existing, albeit "unknown" [08:05:26] status unknown I mean [08:05:29] which is logical [08:05:38] I guess I 'll need to remove the node from the config [08:06:12] yup that fixed that part of the problem [08:07:04] !log akosiaris@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,service=mathoid,cluster=scb,name=scb.* [08:07:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:17] let's see what this fixes [08:07:19] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1975 bytes in 0.078 second response time [08:07:19] <_joe_> and this fixerd pybal [08:07:26] <_joe_> # curl localhost:9090/pools/mathoid_10042 [08:07:26] <_joe_> kubernetes1001.eqiad.wmnet: enabled/up/pooled [08:07:26] <_joe_> kubernetes1003.eqiad.wmnet: enabled/up/pooled [08:07:26] <_joe_> kubernetes1002.eqiad.wmnet: enabled/up/pooled [08:07:26] <_joe_> kubernetes1004.eqiad.wmnet: enabled/down/not pooled [08:07:28] RECOVERY - Mathoid LVS eqiad on mathoid.svc.eqiad.wmnet is OK: All endpoints are healthy [08:07:28] <_joe_> root@lvs1016:~# [08:07:38] <_joe_> instantly, as expected [08:07:40] ah and also the LVS service [08:07:47] ok this I need to figure out why [08:07:59] RECOVERY - MariaDB Slave Lag: s5 on db2052 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [08:08:04] !log akosiaris@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=codfw,service=mathoid,cluster=scb,name=scb.* [08:08:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:09] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal [08:09:02] !log power reset elastic2018 [08:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:30] ema: I asked nicely, and it seems to obey... [08:09:56] gehel: it's your aura, not your demeanor [08:10:02] :P [08:10:12] :) [08:10:40] * akosiaris delivering job security lessons since .... I don't remember when [08:10:41] strong aura! it works when half a globe away [08:11:02] wat?!? now it want to boot on PXE [08:11:05] <_joe_> akosiaris: as I explained [08:11:19] it's transmitted via the keyboard to the copper wires to the fibers to the racks to the boxes [08:11:21] <_joe_> the alerts came whenever a request was routed to kubernetes1004 [08:11:29] RECOVERY - PyBal IPVS diff check on lvs1006 is OK: OK: no difference between hosts in IPVS/PyBal [08:11:41] <_joe_> given how wrr ipvs works, I fear it was preferred most of the time because of being down [08:12:18] _joe_: I think you are partly correct [08:12:19] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1961 bytes in 0.072 second response time [08:12:22] gehel: can I help? (heard pxe and reboot) :D [08:12:28] (03CR) 10Vgutierrez: [C: 031] "LGTM then :)" [debs/pybal] - 10https://gerrit.wikimedia.org/r/436298 (owner: 10Mark Bergsma) [08:12:47] as in, this explains why the LVS paging service did not complain much but the non paging one did [08:13:09] cause the non paging one does multiple requests to the LVS endpoint [08:13:22] and even if one fails, that's good enough for the entire check to fail [08:13:33] <_joe_> yes [08:13:38] whereas the paging one does a very simple check for /_info [08:13:50] and it would succeed 3/4 times I think [08:14:04] <_joe_> not sure about that [08:14:08] volans: good ear! Actually, it says something about no boot device found, then pxe, then back... feel free to have a look on elastic2018 and see if you find a clue as to what's going on... [08:14:08] I need to pull the data from icinga about that number [08:14:24] <_joe_> wrr tends to maintain the number of active connections balanced across nodes [08:14:49] RECOVERY - puppet last run on maps-test2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:15:02] <_joe_> if a node is down, connections are non-active quite fast, if I'm not mistaken [08:15:14] <_joe_> so we might send much more traffic that 1/4 to the down node [08:15:26] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on labvirt1019 - https://phabricator.wikimedia.org/T194907#4245180 (10Volans) [08:15:30] 10Operations, 10ops-eqiad: Degraded RAID on labvirt1019 - https://phabricator.wikimedia.org/T196014#4245182 (10Volans) [08:16:17] https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=mathoid.svc.eqiad.wmnet&service=LVS+HTTP+IPv4 is inconclusive of course [08:16:31] but it does mildly support whan you say [08:17:44] 2 checks failed in the 07:57 to 08:07 [08:17:51] so 2/10 [08:18:05] hmm maybe it does not support it after all [08:18:45] it does look like it's 2.5 rounded down [08:19:08] which is exactly the number I expected to see [08:25:04] (03PS1) 10Alexandros Kosiaris: d-i: Force no_swap for kubernetes partman and note it [puppet] - 10https://gerrit.wikimedia.org/r/436481 [08:27:12] <_joe_> heh, that's better [08:27:55] (03CR) 10Alexandros Kosiaris: [C: 032] d-i: Force no_swap for kubernetes partman and note it [puppet] - 10https://gerrit.wikimedia.org/r/436481 (owner: 10Alexandros Kosiaris) [08:28:32] (03CR) 10Jcrespo: [C: 04-1] "Don't do cross-dc requests without TLS and without connection pooling." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429841 (https://phabricator.wikimedia.org/T192339) (owner: 10Andrew Bogott) [08:31:38] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4245194 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['db2... [08:32:09] 10Operations, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): elastic2018 not rebooting - https://phabricator.wikimedia.org/T196045#4245195 (10Gehel) [08:34:13] (03PS2) 10Alexandros Kosiaris: Delete sshknowngen [puppet] - 10https://gerrit.wikimedia.org/r/436341 (owner: 10Alex Monk) [08:34:18] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Delete sshknowngen [puppet] - 10https://gerrit.wikimedia.org/r/436341 (owner: 10Alex Monk) [08:35:59] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:36:58] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation=compareAndSwap https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:37:08] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation=compareAndSwap https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:39:18] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:39:18] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:39:52] !log ladsgroup@terbium:~$ mwscript deleteAutoPatrolLogs.php --wiki=commonswiki --sleep 2 --check-old [08:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:18] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:42:55] !log power off elastic2018 - T196045 [08:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:00] T196045: elastic2018 not rebooting - https://phabricator.wikimedia.org/T196045 [08:43:02] (03PS1) 10Alexandros Kosiaris: kubernetes: Remove deprecated --api-servers parameter [puppet] - 10https://gerrit.wikimedia.org/r/436483 [08:43:06] 10Operations, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): elastic2018 not rebooting - https://phabricator.wikimedia.org/T196045#4245224 (10Volans) Having a look around in the system utility (ESC+9) I found that: ``` System Health Summary > System BIOS > Health Status: Configuration Required... [08:44:17] 10Operations, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): elastic2018 not rebooting - https://phabricator.wikimedia.org/T196045#4245225 (10Gehel) @Papaul could you have a look at elastic2018 and see if you understand anything? The server is powered off, do anything you'd like with it... [08:49:22] (03CR) 10Giuseppe Lavagetto: ">" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/436240 (https://phabricator.wikimedia.org/T192771) (owner: 10Giuseppe Lavagetto) [08:50:35] 10Operations, 10Wikimedia-Mailing-lists: Mailman issues a "403 Forbidden" error when subscribing to a list - https://phabricator.wikimedia.org/T195750#4245248 (10Sylvain_WMFr) Thanks! I will investigate on why we are blocked, and we will use our smartphones to register in the meantime. [08:51:57] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4245249 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2092.codfw.wmnet'] ``` and were **ALL** successful. [08:53:31] PROBLEM - Disk space on kubernetes1004 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:56:31] RECOVERY - Disk space on kubernetes1004 is OK: DISK OK [08:59:41] 10Operations, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): elastic2018 not rebooting - https://phabricator.wikimedia.org/T196045#4245280 (10MoritzMuehlenhoff) p:05Triage>03Normal a:03Papaul [09:00:09] (03CR) 10Giuseppe Lavagetto: utils: add script to generate mcrouter-related certs (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/436240 (https://phabricator.wikimedia.org/T192771) (owner: 10Giuseppe Lavagetto) [09:01:53] (03PS8) 10Jcrespo: mariadb: Add extra_port on port + 20 for multiinstance hosts [puppet] - 10https://gerrit.wikimedia.org/r/435751 [09:01:55] (03PS1) 10Jcrespo: mariadb: Reimage db2083 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/436484 (https://phabricator.wikimedia.org/T196047) [09:03:39] (03CR) 10Jcrespo: [C: 032] mariadb: Reimage db2083 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/436484 (https://phabricator.wikimedia.org/T196047) (owner: 10Jcrespo) [09:03:44] (03PS2) 10Jcrespo: mariadb: Reimage db2083 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/436484 (https://phabricator.wikimedia.org/T196047) [09:04:26] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1082 for reimage to stretch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436323 (owner: 10Jcrespo) [09:05:48] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1082 for reimage to stretch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436323 (owner: 10Jcrespo) [09:06:26] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy [09:07:07] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [09:07:12] (03PS1) 10Vgutierrez: update-ocsp: Actually use --time-offset-end argument [puppet] - 10https://gerrit.wikimedia.org/r/436485 (https://phabricator.wikimedia.org/T163541) [09:07:26] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation={compareAndSwap,create} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:08:26] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:11:04] (03PS4) 10Giuseppe Lavagetto: puppetmaster::frontend: add cergen-managed CA for mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/436240 (https://phabricator.wikimedia.org/T192771) [09:11:31] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1082 fully (duration: 01m 21s) [09:11:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:35] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster::frontend: add cergen-managed CA for mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/436240 (https://phabricator.wikimedia.org/T192771) (owner: 10Giuseppe Lavagetto) [09:13:56] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1082 for reimage to stretch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436323 (owner: 10Jcrespo) [09:19:30] (03CR) 10Elukey: "> Any new about this patch ?" [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/430069 (owner: 10R4q3NWnUx2CEhVyr) [09:25:05] 10Operations, 10ops-ulsfo, 10Traffic, 10netops, 10Patch-For-Review: Rack/cable/configure ulsfo MX204 - https://phabricator.wikimedia.org/T189552#4245361 (10ayounsi) [09:26:37] (03PS5) 10Giuseppe Lavagetto: puppetmaster::frontend: add cergen-managed CA for mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/436240 (https://phabricator.wikimedia.org/T192771) [09:28:34] !log reimage druid1006 to Debian Stretch [09:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:26] (03PS1) 10Elukey: role::druid::public: set zookeeper version to 3.4.9-3 [puppet] - 10https://gerrit.wikimedia.org/r/436487 (https://phabricator.wikimedia.org/T192636) [09:32:14] (03CR) 10Elukey: [C: 032] role::druid::public: set zookeeper version to 3.4.9-3 [puppet] - 10https://gerrit.wikimedia.org/r/436487 (https://phabricator.wikimedia.org/T192636) (owner: 10Elukey) [09:34:33] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation=compareAndSwap https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:35:12] !log testing icinga alerting [09:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:44] PROBLEM - Host druid1006 is DOWN: PING CRITICAL - Packet loss = 100% [09:36:53] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation=compareAndSwap https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:37:34] RECOVERY - Host druid1006 is UP: PING OK - Packet loss = 0%, RTA = 1.88 ms [09:37:56] weird it is downtimed [09:38:21] elukey: not really... we're having some issue on icinga processing the command file [09:38:31] ah nice :) [09:38:32] I'll probably have to restart it [09:39:23] * volans about to restart icinga, has issues processing the command file [HEADSUP] [09:39:43] PROBLEM - Check whether ferm is active by checking the default input chain on druid1006 is CRITICAL: Return code of 255 is out of bounds [09:39:43] PROBLEM - configured eth on druid1006 is CRITICAL: Return code of 255 is out of bounds [09:39:43] PROBLEM - Check systemd state on druid1006 is CRITICAL: Return code of 255 is out of bounds [09:39:44] PROBLEM - Druid coordinator on druid1006 is CRITICAL: Return code of 255 is out of bounds [09:39:44] PROBLEM - Druid overlord on druid1006 is CRITICAL: Return code of 255 is out of bounds [09:39:53] PROBLEM - dhclient process on druid1006 is CRITICAL: Return code of 255 is out of bounds [09:39:53] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:39:54] PROBLEM - Zookeeper Server on druid1006 is CRITICAL: Return code of 255 is out of bounds [09:39:57] (03PS1) 10Marostegui: db-codfw.php: Depool db2062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436488 [09:40:03] PROBLEM - Druid broker on druid1006 is CRITICAL: Return code of 255 is out of bounds [09:40:06] !log restarting Icinga, issues processing the command file [09:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:51] PROBLEM - Druid middlemanager on druid1006 is CRITICAL: Return code of 255 is out of bounds [09:40:51] PROBLEM - Druid historical on druid1006 is CRITICAL: Return code of 255 is out of bounds [09:41:12] PROBLEM - puppet last run on druid1006 is CRITICAL: Return code of 255 is out of bounds [09:42:09] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436488 (owner: 10Marostegui) [09:42:34] elukey: I've put a stnadard downtime on druid1006 to test if it worked [09:42:42] adjust it at your will ;) [09:42:57] thanks! The host should be back soon from d-i [09:43:04] so whatever you put is fine [09:43:47] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436488 (owner: 10Marostegui) [09:44:03] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:45:26] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2062 for cloning db2092 (duration: 01m 22s) [09:45:28] !log Stop MySQL on db2062 to clone db2092 [09:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:05] (03PS1) 10Ema: cache hosts: enable microcode updates [puppet] - 10https://gerrit.wikimedia.org/r/436490 (https://phabricator.wikimedia.org/T127825) [09:47:23] (03CR) 10Ema: [C: 031] Enable microcode updates for all mediawiki servers [puppet] - 10https://gerrit.wikimedia.org/r/436271 (https://phabricator.wikimedia.org/T127825) (owner: 10Muehlenhoff) [09:47:37] (03CR) 10Muehlenhoff: [C: 031] cache hosts: enable microcode updates [puppet] - 10https://gerrit.wikimedia.org/r/436490 (https://phabricator.wikimedia.org/T127825) (owner: 10Ema) [09:48:09] (03PS1) 10Giuseppe Lavagetto: Add hieradata for mcrouter's ca_secret [labs/private] - 10https://gerrit.wikimedia.org/r/436492 [09:48:54] (03CR) 10jenkins-bot: db-codfw.php: Depool db2062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436488 (owner: 10Marostegui) [09:51:20] (03PS1) 10Jcrespo: mariadb: Repool db2083 after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436493 (https://phabricator.wikimedia.org/T196047) [09:53:31] RECOVERY - Druid middlemanager on druid1006 is OK: PROCS OK: 1 process with command name java, args io.druid.cli.Main server middleManager [09:53:31] RECOVERY - configured eth on druid1006 is OK: OK - interfaces up [09:53:32] RECOVERY - Druid overlord on druid1006 is OK: PROCS OK: 1 process with command name java, args io.druid.cli.Main server overlord [09:53:41] RECOVERY - Druid coordinator on druid1006 is OK: PROCS OK: 1 process with command name java, args io.druid.cli.Main server coordinator [09:54:00] (03CR) 10Marostegui: "Was it cloned already?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436493 (https://phabricator.wikimedia.org/T196047) (owner: 10Jcrespo) [09:54:01] RECOVERY - dhclient process on druid1006 is OK: PROCS OK: 0 processes with command name dhclient [09:59:03] RECOVERY - Check systemd state on druid1006 is OK: OK - running: The system is fully operational [09:59:22] RECOVERY - Zookeeper Server on druid1006 is OK: PROCS OK: 1 process with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg [09:59:32] RECOVERY - Check whether ferm is active by checking the default input chain on druid1006 is OK: OK ferm input default policy is set [09:59:43] RECOVERY - Druid historical on druid1006 is OK: PROCS OK: 1 process with command name java, args io.druid.cli.Main server historical [10:00:22] RECOVERY - Druid broker on druid1006 is OK: PROCS OK: 1 process with command name java, args io.druid.cli.Main server broker [10:01:13] RECOVERY - puppet last run on druid1006 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [10:02:12] ACKNOWLEDGEMENT - Host elastic2018 is DOWN: PING CRITICAL - Packet loss = 100% Volans Hardware issues, host shut down: https://phabricator.wikimedia.org/T196045 [10:02:21] (03PS1) 10Addshore: Wikibase.php shirt around the loading of WikibaseLexeme [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436495 [10:02:23] (03PS1) 10Addshore: Load WikibaseLexeme on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436496 (https://phabricator.wikimedia.org/T195615) [10:02:25] (03PS1) 10Addshore: Load WikibaseLexeme on all of group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436497 (https://phabricator.wikimedia.org/T195615) [10:02:27] (03PS1) 10Addshore: Load WikibaseLexeme on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436498 (https://phabricator.wikimedia.org/T195615) [10:02:30] (03PS1) 10Addshore: Load WikibaseLexeme on all wikidata clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436499 (https://phabricator.wikimedia.org/T195615) [10:03:50] (03PS2) 10Addshore: Wikibase.php shift around the loading of WikibaseLexeme [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436495 [10:04:12] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb={PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:05:40] (03PS1) 10Addshore: Remove not needed Lexeme stuff from -labs files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436500 [10:11:24] (03PS2) 10Muehlenhoff: Remove obsolete mediawiki Upstart jobs [puppet] - 10https://gerrit.wikimedia.org/r/436242 [10:11:26] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Add hieradata for mcrouter's ca_secret [labs/private] - 10https://gerrit.wikimedia.org/r/436492 (owner: 10Giuseppe Lavagetto) [10:12:11] (03CR) 10Muehlenhoff: [C: 032] Remove obsolete mediawiki Upstart jobs [puppet] - 10https://gerrit.wikimedia.org/r/436242 (owner: 10Muehlenhoff) [10:12:13] (03PS6) 10Giuseppe Lavagetto: puppetmaster::frontend: add cergen-managed CA for mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/436240 (https://phabricator.wikimedia.org/T192771) [10:15:33] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:15:46] (03PS2) 10Muehlenhoff: Enable microcode updates for all mediawiki servers [puppet] - 10https://gerrit.wikimedia.org/r/436271 (https://phabricator.wikimedia.org/T127825) [10:16:25] (03CR) 10Muehlenhoff: [C: 032] Enable microcode updates for all mediawiki servers [puppet] - 10https://gerrit.wikimedia.org/r/436271 (https://phabricator.wikimedia.org/T127825) (owner: 10Muehlenhoff) [10:18:18] <_joe_> moritzm: I'm not sure mediawiki upstart jobs are unused everywhere [10:26:18] (03CR) 10Volans: puppetmaster::frontend: add cergen-managed CA for mcrouter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/436240 (https://phabricator.wikimedia.org/T192771) (owner: 10Giuseppe Lavagetto) [10:26:22] I checked watroles and couldn [10:26:30] _joe_: I checked watroles and couldn't find further [10:26:42] PROBLEM - MariaDB Slave Lag: s3 on db2074 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.48 seconds [10:26:53] PROBLEM - MariaDB Slave Lag: s3 on db2043 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.60 seconds [10:26:54] <_joe_> ok, anyways, better to ask for forgiveness than for permission in this case [10:27:03] PROBLEM - MariaDB Slave Lag: s3 on db2050 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.15 seconds [10:27:12] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:27:32] PROBLEM - MariaDB Slave Lag: s3 on db2036 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.36 seconds [10:27:33] PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.37 seconds [10:28:15] (03CR) 10Giuseppe Lavagetto: puppetmaster::frontend: add cergen-managed CA for mcrouter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/436240 (https://phabricator.wikimedia.org/T192771) (owner: 10Giuseppe Lavagetto) [10:30:13] (03PS7) 10Giuseppe Lavagetto: puppetmaster::frontend: add cergen-managed CA for mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/436240 (https://phabricator.wikimedia.org/T192771) [10:33:21] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:33:21] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:35:32] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:36:51] PROBLEM - Check systemd state on thorium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:37:37] this is due to pivot, my bad --^ [10:38:12] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:39:42] PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[pivot] [10:41:04] (03PS1) 10Elukey: role::analytics_cluster::webserver: deprecate pivot [puppet] - 10https://gerrit.wikimedia.org/r/436501 (https://phabricator.wikimedia.org/T194427) [10:41:40] (03PS2) 10Elukey: role::analytics_cluster::webserver: deprecate pivot [puppet] - 10https://gerrit.wikimedia.org/r/436501 (https://phabricator.wikimedia.org/T194427) [10:42:44] (03PS3) 10Elukey: role::analytics_cluster::webserver: deprecate pivot [puppet] - 10https://gerrit.wikimedia.org/r/436501 (https://phabricator.wikimedia.org/T194427) [10:43:15] (03CR) 10Elukey: [C: 032] role::analytics_cluster::webserver: deprecate pivot [puppet] - 10https://gerrit.wikimedia.org/r/436501 (https://phabricator.wikimedia.org/T194427) (owner: 10Elukey) [10:45:26] !log removed Pivot from thorium (pivot.wikimedia.org now simply redirects to Turnilo) [10:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:02] RECOVERY - Check systemd state on thorium is OK: OK - running: The system is fully operational [10:49:51] RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:51:45] (03PS1) 10Elukey: Remove pivot from puppet [puppet] - 10https://gerrit.wikimedia.org/r/436503 [10:51:48] so happy [10:54:35] (03PS2) 10Elukey: Remove pivot from puppet [puppet] - 10https://gerrit.wikimedia.org/r/436503 [10:54:57] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster::frontend: add cergen-managed CA for mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/436240 (https://phabricator.wikimedia.org/T192771) (owner: 10Giuseppe Lavagetto) [10:55:05] (03PS8) 10Giuseppe Lavagetto: puppetmaster::frontend: add cergen-managed CA for mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/436240 (https://phabricator.wikimedia.org/T192771) [10:56:03] (03PS3) 10Elukey: Remove pivot from puppet [puppet] - 10https://gerrit.wikimedia.org/r/436503 [11:07:31] (03PS2) 10Volans: debmonitor: specify MySQL connection options [puppet] - 10https://gerrit.wikimedia.org/r/436286 (https://phabricator.wikimedia.org/T191299) [11:07:33] (03PS1) 10Volans: debmonitor: add cache misc controller [puppet] - 10https://gerrit.wikimedia.org/r/436504 (https://phabricator.wikimedia.org/T191299) [11:07:45] (03PS1) 10Volans: Add public debmonitor.wikimedia.org endpoint [dns] - 10https://gerrit.wikimedia.org/r/436505 (https://phabricator.wikimedia.org/T191299) [11:08:05] (03CR) 10Volans: [C: 04-2] "Pending I9bfeaa6c1360e64c6abff6620013fe829388fb0e" [dns] - 10https://gerrit.wikimedia.org/r/436505 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [11:08:40] (03CR) 10Arturo Borrero Gonzalez: [C: 031] profile::docker::flannel: Use systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/433101 (https://phabricator.wikimedia.org/T190893) (owner: 10Zhuyifei1999) [11:09:29] (03PS1) 10Marostegui: db-eqiad.php: Depool sanitarium masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436506 (https://phabricator.wikimedia.org/T190704) [11:11:13] (03PS4) 10Arturo Borrero Gonzalez: profile::docker::flannel: Use systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/433101 (https://phabricator.wikimedia.org/T190893) (owner: 10Zhuyifei1999) [11:11:25] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool sanitarium masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436506 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [11:12:30] (03Merged) 10jenkins-bot: db-eqiad.php: Depool sanitarium masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436506 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [11:12:37] (03CR) 10Arturo Borrero Gonzalez: [C: 032] profile::docker::flannel: Use systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/433101 (https://phabricator.wikimedia.org/T190893) (owner: 10Zhuyifei1999) [11:12:46] (03CR) 10jenkins-bot: db-eqiad.php: Depool sanitarium masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436506 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [11:13:54] (03PS1) 10Giuseppe Lavagetto: mcrouter_generate_certs: fix a couple typos [puppet] - 10https://gerrit.wikimedia.org/r/436507 [11:14:22] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool sanitarium masters - T190704 (duration: 01m 22s) [11:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:26] T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704 [11:15:50] (03CR) 10Giuseppe Lavagetto: [C: 032] mcrouter_generate_certs: fix a couple typos [puppet] - 10https://gerrit.wikimedia.org/r/436507 (owner: 10Giuseppe Lavagetto) [11:27:59] (03CR) 10Arturo Borrero Gonzalez: [C: 032] Add `webservice-python-bootstrap` command [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/435662 (https://phabricator.wikimedia.org/T174769) (owner: 10Legoktm) [11:29:09] (03PS1) 10Volans: debmonitor: add basic HTTP Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/436509 (https://phabricator.wikimedia.org/T191299) [11:41:19] (03PS1) 10ArielGlenn: allow writeuptopageid to write multiple output files [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/436511 (https://phabricator.wikimedia.org/T196063) [11:46:05] (03PS1) 10Hashar: ci: add VisualEditor and Wikibase to git cache [puppet] - 10https://gerrit.wikimedia.org/r/436512 [11:46:32] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation=compareAndSwap https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:50:50] 10Operations, 10WMF-Blog-Social-Team, 10Wikimedia-Mailing-lists: Request mailman list for upcoming affiliate campaign - https://phabricator.wikimedia.org/T196003#4245711 (10Aklapper) (The Football/Soccer one, I assume? :) ) [11:54:23] PROBLEM - puppet last run on kubernetes1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 23 minutes ago with 1 failures. Failed resources (up to 3 shown): Physical_volume[/dev/md2] [12:00:22] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:03:40] 10Operations, 10WMF-Blog-Social-Team, 10Wikimedia-Mailing-lists: Request mailman list for upcoming affiliate campaign - https://phabricator.wikimedia.org/T196003#4245741 (10MelodyKramer) Yup. ⚽ ⚽ [12:05:22] PROBLEM - Request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:06:23] RECOVERY - Request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:11:12] (03PS2) 10Ema: cache hosts: enable microcode updates [puppet] - 10https://gerrit.wikimedia.org/r/436490 (https://phabricator.wikimedia.org/T127825) [12:11:53] (03CR) 10Ema: [C: 032] cache hosts: enable microcode updates [puppet] - 10https://gerrit.wikimedia.org/r/436490 (https://phabricator.wikimedia.org/T127825) (owner: 10Ema) [12:17:36] (03PS9) 10Jcrespo: mariadb: Add extra_port on port + 20 for multiinstance hosts [puppet] - 10https://gerrit.wikimedia.org/r/435751 [12:17:38] (03PS1) 10Jcrespo: mariadb: Reimage db2079 as stretch, do not reimage db2083 [puppet] - 10https://gerrit.wikimedia.org/r/436516 (https://phabricator.wikimedia.org/T196047) [12:21:24] 10Operations, 10Analytics, 10Traffic: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066#4245756 (10elukey) p:05Triage>03Normal [12:21:29] 10Operations, 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Services (watching): Clean up cpjobqueue metrics - https://phabricator.wikimedia.org/T196067#4245769 (10Pchelolo) p:05Triage>03Normal [12:21:54] (03CR) 10Mark Bergsma: [C: 031] Add tests that emulate client or server sessions initial connection (032 comments) [debs/pybal] - 10https://gerrit.wikimedia.org/r/434162 (owner: 10Mark Bergsma) [12:26:43] (03PS1) 10ArielGlenn: use default resources for xml/sql dumps on snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/436517 [12:29:54] 10Operations, 10Analytics, 10EventBus, 10MediaWiki-JobQueue, and 2 others: Clean up cpjobqueue metrics - https://phabricator.wikimedia.org/T196067#4245785 (10mobrovac) [12:29:57] (03CR) 10ArielGlenn: [C: 032] use default resources for xml/sql dumps on snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/436517 (owner: 10ArielGlenn) [12:30:21] !log Stop replication on all sanitarium masters - T190704 [12:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:26] T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704 [12:37:06] 10Operations, 10Analytics, 10Traffic: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066#4245793 (10elukey) As reference, `prometheus::node_gdnsd` might be an example about how to proceed. [12:40:19] (03CR) 10Jcrespo: [C: 032] mariadb: Repool db2083 after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436493 (https://phabricator.wikimedia.org/T196047) (owner: 10Jcrespo) [12:40:33] (03PS2) 10Jcrespo: mariadb: Reimage db2079 as stretch, do not reimage db2083 [puppet] - 10https://gerrit.wikimedia.org/r/436516 (https://phabricator.wikimedia.org/T196047) [12:40:51] (03CR) 10Jcrespo: [C: 032] mariadb: Reimage db2079 as stretch, do not reimage db2083 [puppet] - 10https://gerrit.wikimedia.org/r/436516 (https://phabricator.wikimedia.org/T196047) (owner: 10Jcrespo) [12:41:59] (03Merged) 10jenkins-bot: mariadb: Repool db2083 after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436493 (https://phabricator.wikimedia.org/T196047) (owner: 10Jcrespo) [12:42:14] (03CR) 10jenkins-bot: mariadb: Repool db2083 after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436493 (https://phabricator.wikimedia.org/T196047) (owner: 10Jcrespo) [12:42:32] PROBLEM - MariaDB Slave Lag: s2 on db1102 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 631.55 seconds [12:42:41] ^ that is me [12:42:42] PROBLEM - MariaDB Slave Lag: s8 on db1116 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 639.82 seconds [12:42:43] PROBLEM - MariaDB Slave Lag: s5 on db1116 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 645.66 seconds [12:42:45] I think I silenced it [12:42:52] PROBLEM - MariaDB Slave Lag: s4 on db1102 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 650.01 seconds [12:42:52] PROBLEM - MariaDB Slave Lag: s5 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 650.50 seconds [12:42:58] probably missed it [12:42:59] doing it now [12:43:03] PROBLEM - MariaDB Slave Lag: s3 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 662.87 seconds [12:43:03] PROBLEM - MariaDB Slave Lag: s6 on db1102 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 664.95 seconds [12:43:07] maybe you did it while the icinga issues was ongoing? [12:43:13] PROBLEM - MariaDB Slave Lag: s3 on db1116 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 676.03 seconds [12:43:22] it happened to several of us [12:43:22] PROBLEM - MariaDB Slave Lag: s8 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 681.10 seconds [12:43:23] PROBLEM - MariaDB Slave Lag: s1 on db1116 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 681.79 seconds [12:43:23] PROBLEM - MariaDB Slave Lag: s1 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 682.95 seconds [12:43:23] PROBLEM - MariaDB Slave Lag: s7 on db1102 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 685.53 seconds [12:43:33] I have silenced them now, too late anyways :( [12:54:31] !log reimage kubernetes200{3,4}.codfw.wmnet [12:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:53] 10Operations, 10DBA, 10Math: Remove table `math` from the database - https://phabricator.wikimedia.org/T196055#4245815 (10Reedy) [12:58:38] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/436520 (https://phabricator.wikimedia.org/T135991) [12:59:05] (03PS1) 10Alexandros Kosiaris: kubernetes: Alter the docker physical volume [puppet] - 10https://gerrit.wikimedia.org/r/436521 [12:59:28] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] kubernetes: Alter the docker physical volume [puppet] - 10https://gerrit.wikimedia.org/r/436521 (owner: 10Alexandros Kosiaris) [13:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: (Dis)respected human, time to deploy European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180531T1300). Please do the needful. [13:00:04] Pablo_WMDE: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:18] o/ [13:00:39] around [13:01:23] Pablo_WMDE: I should have CR+2 ed the patch earlier [13:01:51] RECOVERY - MariaDB Slave Lag: s5 on db1116 is OK: OK slave_sql_lag Replication lag: 0.21 seconds [13:01:52] RECOVERY - MariaDB Slave Lag: s5 on db1124 is OK: OK slave_sql_lag Replication lag: 0.18 seconds [13:03:01] *** patiently waiting [13:04:11] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,run_podsandbox,start_container,stop_podsandbox} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:04:18] expected ^ [13:04:57] with kubernetes2003 rebooted others had to pick up the load and start new pods. I expect this to subside really quickly [13:05:11] RECOVERY - MariaDB Slave Lag: s6 on db1102 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [13:05:23] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1965 bytes in 0.083 second response time [13:05:52] RECOVERY - MariaDB Slave Lag: s4 on db1102 is OK: OK slave_sql_lag Replication lag: 26.33 seconds [13:06:11] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:06:31] RECOVERY - MariaDB Slave Lag: s8 on db1124 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [13:06:31] RECOVERY - MariaDB Slave Lag: s1 on db1116 is OK: OK slave_sql_lag Replication lag: 0.10 seconds [13:06:32] RECOVERY - MariaDB Slave Lag: s7 on db1102 is OK: OK slave_sql_lag Replication lag: 0.26 seconds [13:06:41] RECOVERY - MariaDB Slave Lag: s8 on db1116 is OK: OK slave_sql_lag Replication lag: 0.32 seconds [13:07:32] RECOVERY - MariaDB Slave Lag: s1 on db1124 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [13:08:51] RECOVERY - MariaDB Slave Lag: s2 on db1102 is OK: OK slave_sql_lag Replication lag: 0.39 seconds [13:10:01] PROBLEM - puppet last run on kubestage1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Physical_volume[/dev/md1] [13:10:22] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1971 bytes in 0.078 second response time [13:12:32] PROBLEM - kubelet operational latencies on kubestage1002 is CRITICAL: instance=kubestage1002.eqiad.wmnet operation_type={create_container,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:12:57] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4245853 (10Marostegui) labsdb1009 has been moved under the new sanitarium hosts. We will leave it replicating till Monday befor... [13:13:12] paladox: ok change got merged. Deploying [13:13:14] errr [13:13:17] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4245854 (10Marostegui) [13:13:21] wrong user :) [13:13:21] Pablo_WMDE: change got merged. Deploying to mwdebug1001 [13:13:32] RECOVERY - kubelet operational latencies on kubestage1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:14:38] hashar Checking with extension - very slow. loading... [13:15:03] Pablo_WMDE: it is on mwdebug1001 now [13:16:02] hashar: Looking good. [13:16:25] hashar: https://www.wikidata.org/wiki/Lexeme:L17?uselang=ko (in case you want to take a look). Encoding fixed [13:16:33] hashar: Thanks [13:17:06] 10Operations, 10DBA, 10Math: Remove table `math` from the database - https://phabricator.wikimedia.org/T196055#4245860 (10Marostegui) [13:17:21] (03PS1) 10Urbanecm: Assign movefile to autoreviewrs and patrollers on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436524 (https://phabricator.wikimedia.org/T195247) [13:18:37] 10Operations, 10Wikimedia-Mailing-lists: Mailman issues a "403 Forbidden" error when subscribing to a list - https://phabricator.wikimedia.org/T195750#4245864 (10herron) 05Open>03Resolved a:03herron Sounds good! [13:19:00] Pablo_WMDE: ok deploying :] [13:22:32] RECOVERY - MariaDB Slave Lag: s3 on db1124 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [13:22:32] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1967 bytes in 0.088 second response time [13:22:52] RECOVERY - MariaDB Slave Lag: s3 on db1116 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [13:23:03] !log hashar@tin Synchronized php-1.32.0-wmf.6/vendor: WikibaseLexeme: Encoding problems in labels - T195359 (duration: 03m 12s) [13:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:07] T195359: Encoding problems in labels (Korean, French, etc.) - https://phabricator.wikimedia.org/T195359 [13:23:12] Pablo_WMDE: done [13:24:09] hashar: Many thanks. Works great w/o debug, too [13:24:23] awesome :] [13:25:59] (03CR) 10Urbanecm: [C: 04-1] "Please abandon this patch Zoranzoki21. It was superseded by 436524." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436213 (https://phabricator.wikimedia.org/T195247) (owner: 10Zoranzoki21) [13:26:46] (03PS1) 10BBlack: cache_upload: list for purges for images on cache_misc wikis, too [puppet] - 10https://gerrit.wikimedia.org/r/436526 [13:30:41] (03PS1) 10Ottomata: Blacklist mediawiki_revision_score from Hive refinement [puppet] - 10https://gerrit.wikimedia.org/r/436529 (https://phabricator.wikimedia.org/T195979) [13:30:50] (03PS1) 10Alexandros Kosiaris: kubernetes: Docker physical volume remapping [puppet] - 10https://gerrit.wikimedia.org/r/436530 [13:31:04] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::mcrouter_wancache: update the ssl paths [puppet] - 10https://gerrit.wikimedia.org/r/436531 (https://phabricator.wikimedia.org/T192771) [13:31:06] (03PS1) 10Giuseppe Lavagetto: mcrouter: fix hiera labels, install on mwdebug servers [puppet] - 10https://gerrit.wikimedia.org/r/436532 (https://phabricator.wikimedia.org/T192771) [13:31:27] PROBLEM - puppet last run on kubernetes2003 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 11 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[docker],Physical_volume[/dev/md2] [13:31:33] (03CR) 10Ottomata: [C: 032] Blacklist mediawiki_revision_score from Hive refinement [puppet] - 10https://gerrit.wikimedia.org/r/436529 (https://phabricator.wikimedia.org/T195979) (owner: 10Ottomata) [13:31:40] (03CR) 10jerkins-bot: [V: 04-1] profile::mediawiki::mcrouter_wancache: update the ssl paths [puppet] - 10https://gerrit.wikimedia.org/r/436531 (https://phabricator.wikimedia.org/T192771) (owner: 10Giuseppe Lavagetto) [13:31:54] (03CR) 10jerkins-bot: [V: 04-1] mcrouter: fix hiera labels, install on mwdebug servers [puppet] - 10https://gerrit.wikimedia.org/r/436532 (https://phabricator.wikimedia.org/T192771) (owner: 10Giuseppe Lavagetto) [13:32:46] (03CR) 10Alexandros Kosiaris: [C: 031] Add public debmonitor.wikimedia.org endpoint [dns] - 10https://gerrit.wikimedia.org/r/436505 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [13:33:06] (03CR) 10Alexandros Kosiaris: [C: 031] debmonitor: add basic HTTP Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/436509 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [13:34:17] (03PS2) 10BBlack: cache_upload: listen for cache_misc image purges [puppet] - 10https://gerrit.wikimedia.org/r/436526 [13:34:33] hashar: is swat done? [13:34:50] I believe so as per the deployments page, there was only one patch [13:34:53] (03CR) 10Alexandros Kosiaris: [C: 031] debmonitor: add cache misc controller [puppet] - 10https://gerrit.wikimedia.org/r/436504 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [13:35:14] (03CR) 10Andrew Bogott: "> Don't do cross-dc requests without TLS and without connection pooling" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429841 (https://phabricator.wikimedia.org/T192339) (owner: 10Andrew Bogott) [13:35:16] (03CR) 10Alexandros Kosiaris: [C: 031] debmonitor: specify MySQL connection options [puppet] - 10https://gerrit.wikimedia.org/r/436286 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [13:36:13] (03PS2) 10Volans: Add debmonitor endpoints [dns] - 10https://gerrit.wikimedia.org/r/436505 (https://phabricator.wikimedia.org/T191299) [13:36:24] (03CR) 10Andrew Bogott: [C: 04-1] Read command line arguments from a config file (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/435691 (https://phabricator.wikimedia.org/T148872) (owner: 10Nehajha) [13:37:44] (03CR) 10Muehlenhoff: debmonitor: add basic HTTP Icinga check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/436509 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [13:37:49] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool sanitarium masters" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436533 [13:37:52] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool sanitarium masters" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436533 [13:38:30] (03CR) 10BBlack: [C: 032] cache_upload: listen for cache_misc image purges [puppet] - 10https://gerrit.wikimedia.org/r/436526 (owner: 10BBlack) [13:38:53] (03CR) 10Volans: debmonitor: add basic HTTP Icinga check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/436509 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [13:39:02] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool sanitarium masters" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436533 (owner: 10Marostegui) [13:39:44] (03PS2) 10Volans: debmonitor: add basic HTTP Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/436509 (https://phabricator.wikimedia.org/T191299) [13:40:10] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool sanitarium masters" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436533 (owner: 10Marostegui) [13:40:31] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool sanitarium masters" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436533 (owner: 10Marostegui) [13:41:44] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool sanitarium masters - T190704 (duration: 01m 21s) [13:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:48] T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704 [13:42:09] (03CR) 10Ottomata: [C: 031] Remove pivot from puppet [puppet] - 10https://gerrit.wikimedia.org/r/436503 (owner: 10Elukey) [13:42:36] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1947 bytes in 0.079 second response time [13:42:54] 10Operations, 10Analytics, 10DC-Ops, 10procurement: Analytics hosts missing in Inventory/Refresh - https://phabricator.wikimedia.org/T196072#4245932 (10elukey) [13:44:09] (03PS2) 10Giuseppe Lavagetto: profile::mediawiki::mcrouter_wancache: update the ssl paths [puppet] - 10https://gerrit.wikimedia.org/r/436531 (https://phabricator.wikimedia.org/T192771) [13:44:11] (03PS2) 10Giuseppe Lavagetto: mcrouter: fix hiera labels, install on mwdebug servers [puppet] - 10https://gerrit.wikimedia.org/r/436532 (https://phabricator.wikimedia.org/T192771) [13:44:33] 10Operations, 10Math: Clean up artifacts from LaTeX based math rendering - https://phabricator.wikimedia.org/T195847#4245934 (10Physikerwelt) [13:46:30] (03PS1) 10Marostegui: db2092: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/436536 [13:47:32] (03CR) 10Jcrespo: [C: 04-1] "I thought about this, and at first I thought it could work- now I think it will not. Also moving it to m5 was a bad idea." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429841 (https://phabricator.wikimedia.org/T192339) (owner: 10Andrew Bogott) [13:48:00] (03PS2) 10Marostegui: db2092: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/436536 [13:48:42] (03CR) 10Marostegui: [C: 032] db2092: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/436536 (owner: 10Marostegui) [13:53:07] (03CR) 10Ottomata: [C: 031] webperf: Fix jumbo-eqiad reference to be compatible with Beta Cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/436435 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [13:54:42] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for varnishreqstats [puppet] - 10https://gerrit.wikimedia.org/r/436538 (https://phabricator.wikimedia.org/T135991) [13:58:48] jouncebot: next [13:58:48] In 0 hour(s) and 1 minute(s): WikibaseLexeme (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180531T1400) [13:58:57] I'm going to be 10 mins late for that :) [14:00:05] addshore: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for WikibaseLexeme . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180531T1400). [14:00:29] (03CR) 10Alexandros Kosiaris: [C: 031] Reducing max length for varchar columns [software/debmonitor] - 10https://gerrit.wikimedia.org/r/436243 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [14:01:36] (03CR) 10Alexandros Kosiaris: [C: 031] MySQL config fine-tuning [software/debmonitor] - 10https://gerrit.wikimedia.org/r/436244 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [14:02:08] (03PS1) 10Ottomata: Enable SSL port for Kafka main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/436540 (https://phabricator.wikimedia.org/T193778) [14:02:10] (03PS1) 10Ottomata: Enable inter broker SSL and auth acls for Kafka main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/436541 (https://phabricator.wikimedia.org/T193778) [14:02:50] (03CR) 10Muehlenhoff: [C: 031] debmonitor: add basic HTTP Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/436509 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [14:04:10] (03PS2) 10Ottomata: Enable SSL port for Kafka main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/436540 (https://phabricator.wikimedia.org/T193778) [14:04:14] (03CR) 10Ottomata: [V: 032 C: 032] Enable SSL port for Kafka main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/436540 (https://phabricator.wikimedia.org/T193778) (owner: 10Ottomata) [14:08:00] (03PS3) 10Volans: MySQL config fine-tuning [software/debmonitor] - 10https://gerrit.wikimedia.org/r/436244 (https://phabricator.wikimedia.org/T191299) [14:08:05] (03CR) 10Volans: [C: 032] Reducing max length for varchar columns [software/debmonitor] - 10https://gerrit.wikimedia.org/r/436243 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [14:08:24] !log beginning restarts of Kafka main-eqiad to enable SSL port - T193778 [14:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:28] T193778: SSL and inter broker encryption for Kafka main - https://phabricator.wikimedia.org/T193778 [14:09:09] (03CR) 10Jcrespo: ""the maximum length for a column with an index is 191 chars (767 / 4)" That is not true, long_prefix_index makes it 3000 chars or so." [software/debmonitor] - 10https://gerrit.wikimedia.org/r/436243 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [14:09:17] (03Merged) 10jenkins-bot: Reducing max length for varchar columns [software/debmonitor] - 10https://gerrit.wikimedia.org/r/436243 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [14:10:04] jynus: my understanding is that 3000 or so is for the whole index but still has 767 per column [14:10:31] and I reduced it because it failed to create the tables for that [14:11:15] The index key prefix length limit is 3072 bytes for InnoDB [14:11:32] so why it failed saying > 767? :D [14:11:40] if it fails is becase the tables are badly configured [14:11:47] using an outdate engine [14:12:20] (or row format/file format) [14:12:33] ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_520_ci [14:12:52] the full error was [14:12:52] django.db.utils.OperationalError: (1071, 'Specified key was too long; max key length is 767 bytes') [14:13:02] yes [14:13:17] if you create them with the wrong format [14:13:55] https://phabricator.wikimedia.org/T193222#4163325 [14:14:22] is not the default I guess [14:14:38] we default to binary, as mediawiki does [14:15:29] I'll check how I can tell django this [14:15:49] (03PS2) 10Addshore: Remove not needed Lexeme stuff from -labs files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436500 [14:16:23] (03CR) 10Addshore: [C: 032] Remove not needed Lexeme stuff from -labs files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436500 (owner: 10Addshore) [14:16:35] it is the default on 8.0 / 10.X but we are not yet there [14:16:44] plus it breaks other things [14:17:30] (03Merged) 10jenkins-bot: Remove not needed Lexeme stuff from -labs files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436500 (owner: 10Addshore) [14:17:44] I guess I can enable it, maybe? [14:17:50] (03PS2) 10Ottomata: Enable inter broker SSL and auth acls for Kafka main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/436541 (https://phabricator.wikimedia.org/T193778) [14:17:56] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: Connect or troubleshoot eth1 on labvirt1019 and labvirt1020 - https://phabricator.wikimedia.org/T194964#4246032 (10chasemp) p:05Triage>03Normal [14:17:58] but I cannot set a default_row_format [14:18:01] (03CR) 10Ottomata: [V: 032 C: 032] Enable inter broker SSL and auth acls for Kafka main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/436541 (https://phabricator.wikimedia.org/T193778) (owner: 10Ottomata) [14:18:08] until a newer mysql / mariadb version [14:18:08] I should be able to tell django to set on connection [14:18:30] but I would not compromise the application [14:18:36] but given is useful only when creating tables, I would like to check if I can set it only for the manage.py that takes care of DB migration [14:18:39] the problem is the transition to good config [14:18:49] and not for every connection [14:18:49] sure [14:18:50] is messy [14:19:06] the new versions have everthing as it should [14:19:17] but not-the-latest makes things confusing [14:19:24] (03CR) 10jenkins-bot: Remove not needed Lexeme stuff from -labs files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436500 (owner: 10Addshore) [14:20:21] !log addshore@tin Synchronized wmf-config/Wikibase-labs.php: BETA ONLY [[gerrit:436500|gerrit]] (duration: 01m 21s) [14:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:25] fwiw it was not a compromise, the 255 were arbitrary limits anyway, and I don't think we'll ever get any row hitting that limit, but I'm checking how to have a proper config, that's a good thing anyway [14:20:30] thanks for the feedback [14:21:54] !log addshore@tin Synchronized wmf-config/InitialiseSettings-labs.php: BETA ONLY [[gerrit:436500|gerrit]] (duration: 01m 21s) [14:21:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:26] (03PS3) 10Addshore: Wikibase.php shift around the loading of WikibaseLexeme [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436495 [14:22:30] (03CR) 10Addshore: [C: 032] Wikibase.php shift around the loading of WikibaseLexeme [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436495 (owner: 10Addshore) [14:24:24] (03Merged) 10jenkins-bot: Wikibase.php shift around the loading of WikibaseLexeme [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436495 (owner: 10Addshore) [14:24:45] (03CR) 10Muehlenhoff: [C: 031] debmonitor: add cache misc controller [puppet] - 10https://gerrit.wikimedia.org/r/436504 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [14:25:00] (03CR) 10jenkins-bot: Wikibase.php shift around the loading of WikibaseLexeme [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436495 (owner: 10Addshore) [14:27:42] !log addshore@tin Synchronized wmf-config/Wikibase.php: [[gerrit:436495|Wikibase.php shift around the loading of WikibaseLexeme]] (duration: 01m 22s) [14:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:08] 10Operations, 10ops-codfw, 10Cloud-VPS: move/setup/install labtestnet2002(WMF6469) - https://phabricator.wikimedia.org/T196000#4246064 (10Papaul) a:05RobH>03Papaul [14:30:41] (03CR) 10Muehlenhoff: [C: 031] Add debmonitor endpoints [dns] - 10https://gerrit.wikimedia.org/r/436505 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [14:31:16] jynus: sorry, correct me if I'm wrong. Both innodb_large_prefix and innodb_file_format are global right? So I cannot use them only for debmonitor tables. [14:31:54] but you can ask for it to be changed, at least on installer run [14:32:12] it is changed on most other hosts [14:32:54] PROBLEM - MariaDB Slave Lag: s3 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 695.95 seconds [14:32:57] but m2 is shared [14:33:26] s3 codfw is lagging behind because of the maintenance script running [14:33:32] so that's expected [14:33:39] db2094 is a sanitarium host [14:33:41] (03PS1) 10Ema: prometheus: export intel-microcode information via node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/436553 (https://phabricator.wikimedia.org/T127825) [14:33:43] we chanege it , you run the script, and we revert [14:33:53] then we analyze if we can change it permanently [14:33:56] I don't see the issue [14:34:03] they only affect creates [14:34:08] ahhh, ok, I didn't know it would be ok to have it mixed [14:34:31] in theory it should be like I suggested [14:34:32] (03CR) 10jerkins-bot: [V: 04-1] prometheus: export intel-microcode information via node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/436553 (https://phabricator.wikimedia.org/T127825) (owner: 10Ema) [14:34:47] but honestly, it takes time to check it doesn't break things [14:34:55] we can can do a quick patch now [14:35:11] the barracuda one I think it is applied evertwhere, marostegui? [14:35:23] yeah, it should be [14:35:29] he did a bunch of servers in the past [14:35:42] the long is safe, at least for some time [14:35:53] *large [14:35:54] on db1051 I got Antelope [14:36:25] file format =/= row format [14:36:27] !log addshore@tin Synchronized php-1.32.0-wmf.6/extensions/WikibaseLexeme/src/WikibaseLexemeHooks.php: T195615 Dont run repo only hooks on clients (duration: 01m 24s) [14:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:32] T195615: handle use of statements linking to Lexemes (and Forms?) more gracefully on client - https://phabricator.wikimedia.org/T195615 [14:36:52] right [14:37:02] but yes [14:37:08] it seems it is not barracuda [14:37:09] volans: I think I did most of the servers, but given it had to be changed in 10000 files, maybe I missed some :-) [14:37:17] maybe it got changed on config [14:37:22] but not on live config [14:37:48] modules/role/templates/mariadb/mysqld_config/misc.my.cnf.erb this one has barracuda there [14:38:09] i don't see it on the config [14:38:11] !log addshore@tin Synchronized php-1.32.0-wmf.5/extensions/WikibaseLexeme/src/WikibaseLexemeHooks.php: T195615 Dont run repo only hooks on clients (duration: 01m 20s) [14:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:20] cannot find it on /etc/my.cnf on the host [14:38:21] volans: the point is, if you need help, you ask for it, and we help you [14:38:27] is db1051 using that template? [14:38:54] I'm ok with 190 for now jynus, so no strict need for me. But if you prefer to have newer DB created with the newer formats to avoid to migrate later [14:38:59] I'm all for it [14:39:08] (03PS2) 10Addshore: Load WikibaseLexeme on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436496 (https://phabricator.wikimedia.org/T195615) [14:39:11] (03CR) 10Addshore: [C: 032] Load WikibaseLexeme on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436496 (https://phabricator.wikimedia.org/T195615) (owner: 10Addshore) [14:39:25] 190 is a bit silly [14:39:36] it was tought for 3 byte chars [14:39:51] ? [14:40:03] (03PS3) 10Muehlenhoff: Enable Intel microcode installation for labvirt [puppet] - 10https://gerrit.wikimedia.org/r/433359 (https://phabricator.wikimedia.org/T194258) [14:40:08] 190*4 = 760 [14:40:17] (03Merged) 10jenkins-bot: Load WikibaseLexeme on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436496 (https://phabricator.wikimedia.org/T195615) (owner: 10Addshore) [14:40:32] (03CR) 10jenkins-bot: Load WikibaseLexeme on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436496 (https://phabricator.wikimedia.org/T195615) (owner: 10Addshore) [14:40:42] or you mean that 767 / 3 is 255 :D [14:40:46] marostegui: I see the issue [14:40:49] <% if scope['::role::mariadb::misc::shard'] == 'm5' -%> [14:40:58] 10Operations, 10ops-codfw, 10Cloud-VPS: move/setup/install labtestnet2002(WMF6469) - https://phabricator.wikimedia.org/T196000#4246108 (10Papaul) a:05Papaul>03RobH [14:41:00] it was supposed to be removed, that was just a test [14:41:15] I can do it now [14:41:27] Ah ok :) [14:41:33] PROBLEM - MariaDB Slave Lag: s8 on db2079 is CRITICAL: CRITICAL slave_sql_lag could not connect [14:41:36] it is actually a bug [14:41:46] jynus: ^ that is you, right? [14:41:47] because it will break replication with the replicas [14:41:49] yes [14:41:54] PROBLEM - MariaDB Slave IO: s8 on db2079 is CRITICAL: CRITICAL slave_io_state could not connect [14:42:00] i will silence it [14:42:02] your conversation distracted me from reimage it [14:42:04] don't [14:42:11] ok [14:42:11] I will take care of it soon [14:42:13] :) [14:42:18] PROBLEM - mysqld processes on db2079 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [14:42:23] PROBLEM - MariaDB Slave SQL: s8 on db2079 is CRITICAL: CRITICAL slave_sql_state could not connect [14:42:54] got the page anyways [14:43:21] my fault, I was distracting them ;) [14:43:25] heh [14:43:38] volans: just ask for help when you see a problem [14:43:44] don't workaround it ugly [14:43:50] in this case it was a bug [14:43:56] it was supposed to be there [14:44:04] but test code was not removed [14:44:23] and removing it it is just easier [14:44:46] kk :) [14:45:27] glad we found it, I didn't ask just because I had this bad info that also with innodb_large_prefix the single column was limited... my bad [14:45:36] ? [14:45:40] limited? [14:45:48] limited to 767 anyway [14:45:51] no [14:46:05] I know now, but I had this bad info in my head, dunno from where I got it [14:46:10] reading bad docs ;) [14:46:14] it just to take effect has restrictions [14:46:23] it needs compact or compressed [14:46:28] which requires barracuda [14:46:32] and file per table [14:46:42] etc. [14:46:45] and it is messy [14:46:48] yeah [14:46:57] and then defaults change [14:47:05] and people get even more confused [14:47:37] for example, large_prefix_index no longer exists on some versions as it is the default [14:47:43] (03PS1) 10Addshore: Revert "Load WikibaseLexeme on testwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436556 [14:47:54] (03PS1) 10Jcrespo: mariadb: Enable barracuda and large prefix index on m1 and m2, too [puppet] - 10https://gerrit.wikimedia.org/r/436557 (https://phabricator.wikimedia.org/T150949) [14:48:00] if you think we can change it for m2 (either temporarily or globablly) without too much work I can revert the last commit and wait for it [14:48:21] in fact I think we should do it permanently [14:48:25] or replication will break [14:48:34] after all, it will only affect new creates [14:48:49] (03CR) 10Addshore: [C: 032] Revert "Load WikibaseLexeme on testwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436556 (owner: 10Addshore) [14:48:54] it was applied to m2 replicas already [14:49:24] see also the comment: # innodb_default_row_format is not available until 5.7.9 and maybe until 10.2 [14:49:55] yeah [14:49:55] (03Merged) 10jenkins-bot: Revert "Load WikibaseLexeme on testwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436556 (owner: 10Addshore) [14:49:55] so check which row format is being used by your commands after the change [14:50:00] or it won't work [14:50:33] https://gerrit.wikimedia.org/r/#/c/436557/ [14:50:59] marostegui: on the servers you changed, did you also alter the live config? [14:51:13] we may want to review all live values [14:51:25] jynus: I think I did on lots of them, but I can review it [14:51:39] doesn't need to happen now [14:51:39] (03CR) 10jenkins-bot: Revert "Load WikibaseLexeme on testwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436556 (owner: 10Addshore) [14:51:44] just a todo [14:51:45] I can help, feeling guilty to have added more work ;) [14:51:52] you didn't added work [14:51:56] you spoted a bug we had [14:52:11] (03PS2) 10Jcrespo: mariadb: Enable barracuda and large prefix index on m1 and m2, too [puppet] - 10https://gerrit.wikimedia.org/r/436557 (https://phabricator.wikimedia.org/T150949) [14:52:20] (03PS1) 10Addshore: Load WikibaseLexeme on testwiki (again) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436560 [14:52:36] (03PS2) 10Addshore: Load WikibaseLexeme on all of group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436497 (https://phabricator.wikimedia.org/T195615) [14:52:41] (03PS2) 10Addshore: Load WikibaseLexeme on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436498 (https://phabricator.wikimedia.org/T195615) [14:52:47] (03PS2) 10Addshore: Load WikibaseLexeme on all wikidata clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436499 (https://phabricator.wikimedia.org/T195615) [14:52:53] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: NOOP [[gerrit:436496|patch]] and [[gerrit:436556|revert]] Load WikibaseLexeme on testwiki (sanity) (duration: 01m 22s) [14:52:57] marostegui: we may not have changed large index on many hosts [14:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:13] so we can file that as a ticket and check later [14:53:23] sure [14:53:25] only scrict mode and barracuda [14:53:36] (which was the original scopt) [14:53:39] to be fair [14:53:49] Yeah [14:54:30] volans: please give a look at https://gerrit.wikimedia.org/r/436557 [14:54:34] (03CR) 10Muehlenhoff: [C: 032] Enable Intel microcode installation for labvirt [puppet] - 10https://gerrit.wikimedia.org/r/433359 (https://phabricator.wikimedia.org/T194258) (owner: 10Muehlenhoff) [14:54:42] and we can change all m2 [14:54:47] leave the others for later [14:54:51] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/436557 (https://phabricator.wikimedia.org/T150949) (owner: 10Jcrespo) [14:55:14] puppet wise is perfect, you know much better the consequences mysql wise ;) [14:55:16] (03PS3) 10Jcrespo: mariadb: Enable barracuda and large prefix index on m1 and m2, too [puppet] - 10https://gerrit.wikimedia.org/r/436557 (https://phabricator.wikimedia.org/T150949) [14:55:35] well, I will not apply it to core [14:55:44] because it may have some incompatiblity issues [14:55:52] but for existing tables or new deployments [14:55:56] it should be ok [14:56:13] we also have pending a migration to compressed [14:56:22] so we will need it eventually [14:56:30] (03CR) 10Jcrespo: [C: 032] mariadb: Enable barracuda and large prefix index on m1 and m2, too [puppet] - 10https://gerrit.wikimedia.org/r/436557 (https://phabricator.wikimedia.org/T150949) (owner: 10Jcrespo) [14:56:44] PROBLEM - SSH on labtestservices2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:57:46] indeed [14:58:53] RECOVERY - SSH on labtestservices2001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.10 (protocol 2.0) [14:59:53] volans: so which one of the two of us will create an outage first by confusing debmonitor and dbmonitor ? [15:00:13] (03PS2) 10Ema: prometheus: export intel-microcode information via node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/436553 (https://phabricator.wikimedia.org/T127825) [15:00:21] 10Operations, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): elastic2018 not rebooting - https://phabricator.wikimedia.org/T196045#4246130 (10Papaul) a:05Papaul>03Gehel @Gehel for some reason, the server lost some settings like in the BIOS Serial Console & EMS EMS Console was COM1 , BOOT op... [15:00:43] (03CR) 10jerkins-bot: [V: 04-1] prometheus: export intel-microcode information via node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/436553 (https://phabricator.wikimedia.org/T127825) (owner: 10Ema) [15:00:56] jynus: eheheh I tought about it when adding it to various hiera/dns places :D [15:01:12] RECOVERY - Host elastic2018 is UP: PING OK - Packet loss = 0%, RTA = 36.13 ms [15:01:56] volans: create table test (c varchar(200), KEY(c)) row_format=dynamic; [15:02:02] Query OK, 0 rows affected (0.00 sec) [15:02:26] we cannot setup the dynamic value from global config [15:02:31] (03PS3) 10Ema: prometheus: export intel-microcode information via node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/436553 (https://phabricator.wikimedia.org/T127825) [15:02:35] only on 5.7/10.2 [15:02:38] so I must tell django to use it, right? [15:02:39] so that is up to you [15:03:12] and I think it is the default on 8.0/10.3, don't quote me exactly [15:03:46] (03CR) 10jerkins-bot: [V: 04-1] prometheus: export intel-microcode information via node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/436553 (https://phabricator.wikimedia.org/T127825) (owner: 10Ema) [15:03:51] replication didn't broke, so you would be ok [15:04:14] great! [15:04:43] RECOVERY - MariaDB Slave Lag: s3 on db2036 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [15:04:43] your are future proofing mysql [15:04:43] RECOVERY - MariaDB Slave Lag: s3 on db2094 is OK: OK slave_sql_lag Replication lag: 0.36 seconds [15:04:51] buy running into 2 bugs alreadyu [15:04:52] 10Operations, 10ops-codfw, 10Cloud-VPS: move/setup/install labtestnet2002(WMF6469) - https://phabricator.wikimedia.org/T196000#4246140 (10RobH) a:05RobH>03Papaul @mark approved this allocation, go ahead and start the steps and move the system, thanks! [15:05:03] RECOVERY - MariaDB Slave Lag: s3 on db2050 is OK: OK slave_sql_lag Replication lag: 0.33 seconds [15:05:12] RECOVERY - MariaDB Slave Lag: s3 on db2043 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [15:05:16] allow me to atend now db2079 [15:05:22] RECOVERY - MariaDB Slave Lag: s3 on db2074 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [15:05:22] as there is nothing else I can do on my side [15:05:39] sure take care of it [15:07:13] PROBLEM - SSH on labtestservices2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:12:52] RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [15:14:41] 10Operations, 10Analytics, 10hardware-requests: Site: eqiad | Hardware refresh for analytics100[1,2] - https://phabricator.wikimedia.org/T196079#4246157 (10elukey) [15:15:57] (03CR) 10Jcrespo: [C: 04-1] "I can offer setting up a database m5-test, somewhere, for example, as an alternative, on codfw only." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429841 (https://phabricator.wikimedia.org/T192339) (owner: 10Andrew Bogott) [15:16:13] 10Operations, 10ops-ulsfo, 10Traffic, 10netops: troubleshoot cr3/cr4 link - https://phabricator.wikimedia.org/T196030#4246171 (10RobH) >>! In T196030#4245080, @ayounsi wrote: > > Probably still need to swap RX/TX. I replaced both of the optics with wholly different optics and a wholly different fiber cab... [15:16:16] 10Operations, 10Analytics, 10hardware-requests: Site: eqiad | Hardware refresh for analytics100[1,2] - https://phabricator.wikimedia.org/T196079#4246157 (10elukey) [15:24:22] (03CR) 10Bstorm: [C: 032] wiki replicas: maintain-dbusers to skip offline labsdb servers [puppet] - 10https://gerrit.wikimedia.org/r/436353 (https://phabricator.wikimedia.org/T188681) (owner: 10Bstorm) [15:24:30] (03PS3) 10Bstorm: wiki replicas: maintain-dbusers to skip offline labsdb servers [puppet] - 10https://gerrit.wikimedia.org/r/436353 (https://phabricator.wikimedia.org/T188681) [15:25:57] 10Operations, 10Analytics, 10hardware-requests: Site: eqiad | hardware request for a dedicated stat analytics host for the Research team - https://phabricator.wikimedia.org/T196080#4246185 (10elukey) [15:29:03] jouncebot: next [15:29:03] In 0 hour(s) and 30 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180531T1600) [15:29:20] (03PS1) 10Andrew Bogott: keystonehooks: Add any new project member to bastion [puppet] - 10https://gerrit.wikimedia.org/r/436570 (https://phabricator.wikimedia.org/T165337) [15:30:10] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for librenms-syslog [puppet] - 10https://gerrit.wikimedia.org/r/436571 (https://phabricator.wikimedia.org/T135991) [15:30:20] (03PS2) 10Andrew Bogott: keystonehooks: Add any new project member to bastion [puppet] - 10https://gerrit.wikimedia.org/r/436570 (https://phabricator.wikimedia.org/T165337) [15:31:08] (03PS3) 10Andrew Bogott: keystonehooks: Add any new project member to bastion [puppet] - 10https://gerrit.wikimedia.org/r/436570 (https://phabricator.wikimedia.org/T165337) [15:31:15] !log reimage db2079 [15:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:29] RECOVERY - SSH on labtestservices2001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.10 (protocol 2.0) [15:40:38] PROBLEM - SSH on labtestservices2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:40:47] the quest for making that row_format happen is harder than I thought, but I have an idea, doing some test [15:40:53] (03CR) 10Andrew Bogott: "> I can offer setting up a database m5-test, somewhere, for example, as an alternative, on codfw only." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429841 (https://phabricator.wikimedia.org/T192339) (owner: 10Andrew Bogott) [15:41:39] jynus: thanks a lot for the insight and the fixes! I will also like your opinion on https://gerrit.wikimedia.org/r/#/c/436286/ when you've time ;) [15:42:14] (03PS2) 10Herron: standard::mail::sender: run a smtp daemon on localhost:25 [puppet] - 10https://gerrit.wikimedia.org/r/429456 (https://phabricator.wikimedia.org/T175361) [15:42:15] !log addshore@tin Synchronized php-1.32.0-wmf.6/extensions/WikibaseLexeme: [[gerrit:436566|Only add repo-specific entity type definition elements in Repo context]] T195615 (duration: 01m 32s) [15:42:15] andrewbogott: is labtestservices2001 unavailability you? [15:42:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:20] T195615: handle use of statements linking to Lexemes (and Forms?) more gracefully on client - https://phabricator.wikimedia.org/T195615 [15:43:04] chasemp: I'm not doing anything there, although that would explain why ldap was so slow for me just now... [15:43:13] I'll log on and see what's up. [15:43:27] andrewbogott: I recall this happend before and it was some load issue [15:43:31] but maybe I'm confused [15:43:40] It's certainly acting like a load issue [15:44:49] hm, chasemp, I don't think it's CPU load at least [15:44:56] !log addshore@tin Synchronized php-1.32.0-wmf.5/extensions/WikibaseLexeme: [[gerrit:436567|Only add repo-specific entity type definition elements in Repo context]] T195615 (duration: 01m 32s) [15:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:59] andrewbogott: agreed, nor IO [15:46:25] (03CR) 10星耀晨曦: [C: 04-1] Assign movefile to autoreviewrs and patrollers on zhwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436524 (https://phabricator.wikimedia.org/T195247) (owner: 10Urbanecm) [15:46:31] I'm getting the same super-slow-login issue on labtestcontrol2001 now [15:46:40] !log enabling localhost:25 exim smtp listeners in production realm T175361 [15:46:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:45] T175361: Upgrade mx1001/mx2001 to stretch - https://phabricator.wikimedia.org/T175361 [15:46:45] (03CR) 10Herron: [C: 032] standard::mail::sender: run a smtp daemon on localhost:25 [puppet] - 10https://gerrit.wikimedia.org/r/429456 (https://phabricator.wikimedia.org/T175361) (owner: 10Herron) [15:46:47] (03CR) 10Addshore: [C: 032] Load WikibaseLexeme on testwiki (again) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436560 (owner: 10Addshore) [15:47:19] (03PS1) 10Volans: Revert "Reducing max length for varchar columns" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/436573 [15:47:56] (03Merged) 10jenkins-bot: Load WikibaseLexeme on testwiki (again) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436560 (owner: 10Addshore) [15:49:12] (03CR) 10jenkins-bot: Load WikibaseLexeme on testwiki (again) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436560 (owner: 10Addshore) [15:50:08] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10Performance-Team, 10Patch-For-Review: Include role/common in beta-cluster hieradata hierarchy - https://phabricator.wikimedia.org/T196034#4246255 (10Krinkle) [15:50:19] so andrewbogott atm I don't see it there and I'm unsure about labtestservices2001 [15:50:25] ok [15:50:36] I don't know what's up with labtestservices2001 either. It's definitely acting poorly [15:50:39] Tempted to just reboot it [15:50:54] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: [[gerrit:436560|Load WikibaseLexeme on testwiki (again)]] T195615 (duration: 01m 21s) [15:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:58] T195615: handle use of statements linking to Lexemes (and Forms?) more gracefully on client - https://phabricator.wikimedia.org/T195615 [15:51:00] we've seen this w/ syslog issues where it behaves as if it's load but load isn't actually high when OOM or something hits syslog [15:51:02] just a guess [15:51:21] andrewbogott: sure, I'm for it [15:51:27] (03CR) 10Addshore: [C: 032] Load WikibaseLexeme on all of group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436497 (https://phabricator.wikimedia.org/T195615) (owner: 10Addshore) [15:51:31] ok, will downtime first [15:51:46] (I have to go deal with a contractor in a minute) [15:52:50] !log rebooting labtestservices2001 to troubleshoot unknown load problems [15:52:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:59] (03Merged) 10jenkins-bot: Load WikibaseLexeme on all of group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436497 (https://phabricator.wikimedia.org/T195615) (owner: 10Addshore) [15:54:28] (03CR) 10jenkins-bot: Load WikibaseLexeme on all of group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436497 (https://phabricator.wikimedia.org/T195615) (owner: 10Addshore) [15:57:03] RECOVERY - SSH on labtestservices2001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.10 (protocol 2.0) [15:58:22] RECOVERY - MariaDB Slave SQL: s8 on db2079 is OK: OK slave_sql_state Slave_SQL_Running: Yes [15:58:53] RECOVERY - MariaDB Slave IO: s8 on db2079 is OK: OK slave_io_state Slave_IO_Running: Yes [15:59:00] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: [[gerrit:436497|Load WikibaseLexeme on group0]] T195615 (duration: 01m 18s) [15:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:04] T195615: handle use of statements linking to Lexemes (and Forms?) more gracefully on client - https://phabricator.wikimedia.org/T195615 [15:59:11] 10Puppet, 10Beta-Cluster-Infrastructure, 10Cloud-Services, 10Release-Engineering-Team (Next), 10User-Joe: Re-think puppet management for deployment-prep - https://phabricator.wikimedia.org/T161675#3139285 (10Krinkle) @Joe I support the idea of not allowing sharing of role-related hieradata between prod a... [15:59:12] (03Abandoned) 10Krinkle: deployment-prep: Remove override for scap::sources [puppet] - 10https://gerrit.wikimedia.org/r/436439 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [15:59:21] (03Abandoned) 10Krinkle: puppetmaster: Add role_hierarchy to labs.hiera [puppet] - 10https://gerrit.wikimedia.org/r/436440 (https://phabricator.wikimedia.org/T196034) (owner: 10Krinkle) [16:00:04] godog, moritzm, and _joe_: #bothumor I � Unicode. All rise for Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180531T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:14] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10Performance-Team: Include role/common in beta-cluster hieradata hierarchy - https://phabricator.wikimedia.org/T196034#4246279 (10Krinkle) [16:00:42] (03PS1) 10Ppchelko: Remove unused jobrunners. [puppet] - 10https://gerrit.wikimedia.org/r/436574 (https://phabricator.wikimedia.org/T190327) [16:00:55] (03PS1) 10Addshore: Revert "Load WikibaseLexeme on all of group0" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436575 [16:01:06] (03CR) 10Addshore: [V: 032 C: 032] Revert "Load WikibaseLexeme on all of group0" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436575 (owner: 10Addshore) [16:01:16] reverting as it is causing errors on api.php on mw.org [16:01:21] (03CR) 10jenkins-bot: Revert "Load WikibaseLexeme on all of group0" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436575 (owner: 10Addshore) [16:03:06] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: REVERT Load WikibaseLexeme on group0 T195615 (duration: 01m 21s) [16:03:09] {{done}} [16:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:16] andrewbogott: seems better to me when it came back but that's pretty odd [16:03:25] 10Operations, 10ops-codfw, 10Cloud-VPS: move/setup/install labtestnet2002(WMF6469) - https://phabricator.wikimedia.org/T196000#4246294 (10Papaul) We have already a hosts name labtestnet2002 [16:03:31] beh, https://test.wikipedia.org/w/api.php also errors, so I'll revert that one too [16:03:49] (03PS1) 10Addshore: Revert "Load WikibaseLexeme on testwiki (again)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436577 [16:03:54] (03CR) 10Addshore: [C: 032] Revert "Load WikibaseLexeme on testwiki (again)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436577 (owner: 10Addshore) [16:04:29] 10Operations, 10ops-codfw, 10Cloud-VPS: move/setup/install labtestnet2002(WMF6469) - https://phabricator.wikimedia.org/T196000#4246297 (10RobH) My bad, updating task and lets call this labtestnet2003. This is to replace labtestnet2001, and I assumed it would be 2002, my bad! [16:05:01] (03Merged) 10jenkins-bot: Revert "Load WikibaseLexeme on testwiki (again)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436577 (owner: 10Addshore) [16:05:04] 10Operations, 10ops-codfw, 10Cloud-VPS: move/setup/install labtestnet2003(WMF6469) - https://phabricator.wikimedia.org/T196000#4246298 (10RobH) [16:05:30] (03CR) 10Addshore: [C: 04-2] "not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436498 (https://phabricator.wikimedia.org/T195615) (owner: 10Addshore) [16:05:36] (03CR) 10Addshore: [C: 04-2] "not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436499 (https://phabricator.wikimedia.org/T195615) (owner: 10Addshore) [16:06:37] (03CR) 10jenkins-bot: Revert "Load WikibaseLexeme on testwiki (again)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436577 (owner: 10Addshore) [16:06:44] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: REVERT Load WikibaseLexeme on testwiki (again) T195615 (duration: 01m 21s) [16:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:49] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10Performance-Team: Define scap::sources in a way that is shared between prod and beta - https://phabricator.wikimedia.org/T196034#4246302 (10Krinkle) [16:06:49] T195615: handle use of statements linking to Lexemes (and Forms?) more gracefully on client - https://phabricator.wikimedia.org/T195615 [16:06:58] (03PS1) 10ArielGlenn: use 6 parallel jobs for xml/sql dumps of big wikis [puppet] - 10https://gerrit.wikimedia.org/r/436578 [16:07:14] !log WikibaseLexeme slot done (7 min overrun) [16:07:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:27] Yeah, seems better [16:07:43] 10Operations, 10ops-codfw, 10Cloud-VPS: move/setup/install labtestnet2003(WMF6469) - https://phabricator.wikimedia.org/T196000#4246305 (10Papaul) moved wmf6469 from D5 to B1 new Switch port information ge-1/0/15 old switch port information ge-5/0/16 [16:08:06] 10Operations, 10ops-codfw, 10Cloud-VPS: move/setup/install labtestnet2003(WMF6469) - https://phabricator.wikimedia.org/T196000#4246306 (10Papaul) [16:08:13] (03CR) 10Jcrespo: [C: 04-1] "So, I don't NEED that I manage that. I am offering you that, IF YOU WANT. My only requirement is that it is either separate from mediawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429841 (https://phabricator.wikimedia.org/T192339) (owner: 10Andrew Bogott) [16:16:32] RECOVERY - MariaDB Slave Lag: s8 on db2079 is OK: OK slave_sql_lag Replication lag: 19.16 seconds [16:19:47] ^that should fix db2079 [16:24:42] (03PS1) 10Papaul: DNS: Add mgmt DNS entries for labtestnet2003 [dns] - 10https://gerrit.wikimedia.org/r/436579 (https://phabricator.wikimedia.org/T196000) [16:25:24] (03PS2) 10Muehlenhoff: Remove now obsolete os conditional [puppet] - 10https://gerrit.wikimedia.org/r/436241 [16:28:34] ACKNOWLEDGEMENT - puppet last run on kubestage1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 23 minutes ago with 1 failures. Failed resources (up to 3 shown): Physical_volume[/dev/md1] alexandros kosiaris box is to be reimaged, ignore this [16:28:47] ACKNOWLEDGEMENT - puppet last run on kubernetes1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 24 minutes ago with 1 failures. Failed resources (up to 3 shown): Physical_volume[/dev/md2] alexandros kosiaris box is to be reimaged, ignore this [16:29:31] 10Operations, 10Cassandra, 10Discovery, 10Maps: cassandra 2.2.6-wmf4 is not compatible with python 2.7.13 (debian stretch) - https://phabricator.wikimedia.org/T196044#4246345 (10MoritzMuehlenhoff) p:05Triage>03Normal [16:29:41] (03PS1) 10Dzahn: planet: rm planet-venus feed templates, rename feeds_rawdog to feeds [puppet] - 10https://gerrit.wikimedia.org/r/436580 (https://phabricator.wikimedia.org/T180498) [16:30:55] (03PS1) 10Krinkle: Move scap::sources from role::deployment_server to common [puppet] - 10https://gerrit.wikimedia.org/r/436581 (https://phabricator.wikimedia.org/T161675) [16:31:13] (03CR) 10jerkins-bot: [V: 04-1] Move scap::sources from role::deployment_server to common [puppet] - 10https://gerrit.wikimedia.org/r/436581 (https://phabricator.wikimedia.org/T161675) (owner: 10Krinkle) [16:32:34] 10Operations, 10fundraising-tech-ops: Long term storage for frack prometheus data - https://phabricator.wikimedia.org/T175738#4246366 (10RobH) [16:33:16] (03PS2) 10Krinkle: Move scap::sources from role::deployment_server to common [puppet] - 10https://gerrit.wikimedia.org/r/436581 (https://phabricator.wikimedia.org/T161675) [16:33:54] 10Operations, 10Analytics, 10hardware-requests: Site: eqiad | hardware request for a dedicated stat analytics host for the Research team - https://phabricator.wikimedia.org/T196080#4246388 (10Nuria) p:05Triage>03Normal [16:34:23] (03CR) 10ArielGlenn: [C: 031] "yay!" [puppet] - 10https://gerrit.wikimedia.org/r/436241 (owner: 10Muehlenhoff) [16:34:32] 10Operations, 10Analytics, 10hardware-requests: Site: eqiad | Hardware refresh for analytics100[1,2] - https://phabricator.wikimedia.org/T196079#4246394 (10Nuria) p:05Triage>03Normal [16:35:30] (03CR) 10Paladox: [C: 031] "LGTM (reviewed all the files) Probaly want to do a puppet compiler in case?" [puppet] - 10https://gerrit.wikimedia.org/r/436580 (https://phabricator.wikimedia.org/T180498) (owner: 10Dzahn) [16:36:36] (03PS2) 10ArielGlenn: use 6 parallel jobs for xml/sql dumps of big wikis [puppet] - 10https://gerrit.wikimedia.org/r/436578 [16:37:22] 10Operations, 10Analytics, 10DC-Ops, 10procurement: Analytics hosts missing in Inventory/Refresh - https://phabricator.wikimedia.org/T196072#4246405 (10Nuria) p:05Triage>03Normal [16:37:27] 10Operations, 10Analytics, 10DC-Ops, 10procurement: Analytics hosts missing in Inventory/Refresh - https://phabricator.wikimedia.org/T196072#4245913 (10Nuria) p:05Normal>03Low [16:37:50] (03PS1) 10Dzahn: planet: move plugin dir out of feeds dir [puppet] - 10https://gerrit.wikimedia.org/r/436583 [16:38:10] (03CR) 10Arturo Borrero Gonzalez: [C: 031] "I reviewed the patch and the setup. I suggested Chase to also update novaobserver.yaml in the same run." [puppet] - 10https://gerrit.wikimedia.org/r/433734 (https://phabricator.wikimedia.org/T167559) (owner: 10Rush) [16:38:18] 10Operations, 10ops-eqiad, 10User-Eevans: Degraded RAID on restbase-dev1006 - https://phabricator.wikimedia.org/T185494#4246430 (10RobH) [16:38:23] (03CR) 10Dzahn: "wait about a week - then reinstall planet 1001 - then merge this" [puppet] - 10https://gerrit.wikimedia.org/r/436580 (https://phabricator.wikimedia.org/T180498) (owner: 10Dzahn) [16:39:00] 10Operations, 10ops-codfw, 10Cloud-VPS, 10Patch-For-Review: move/setup/install labtestnet2003(WMF6469) - https://phabricator.wikimedia.org/T196000#4246443 (10Papaul) [16:39:22] 10Operations, 10ops-codfw, 10Cloud-VPS, 10Patch-For-Review: move/setup/install labtestnet2003(WMF6469) - https://phabricator.wikimedia.org/T196000#4243908 (10Papaul) a:05Papaul>03RobH [16:40:11] 10Operations, 10Cassandra, 10Services (blocked), 10User-Eevans, 10User-fgiunchedi: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822#4246462 (10RobH) [16:40:20] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4246465 (10RobH) [16:43:21] 10Operations, 10Analytics, 10Analytics-Wikistats, 10Domains, 10Traffic: HTTP 500 on stats.wikipedia.org (invalid domain) - https://phabricator.wikimedia.org/T195568#4231062 (10Nuria) Option b) sounds good. [16:44:07] (03Draft1) 10Paladox: Planet: Update https://stuwest.org/category/wiki/feed/ to https://stu.blog/category/wiki/feed/ [puppet] - 10https://gerrit.wikimedia.org/r/436584 [16:44:11] (03PS2) 10Paladox: Planet: Update https://stuwest.org/category/wiki/feed/ to https://stu.blog/category/wiki/feed/ [puppet] - 10https://gerrit.wikimedia.org/r/436584 [16:44:27] (03PS3) 10Paladox: Planet: rawdog rss url for en [puppet] - 10https://gerrit.wikimedia.org/r/436584 [16:46:30] 10Puppet, 10Analytics, 10Analytics-Kanban, 10Beta-Cluster-Infrastructure: deployment-eventlog05 puppet error about missing mysql heartbeat.heartbeat table - https://phabricator.wikimedia.org/T191109#4093870 (10Nuria) a:03elukey [16:50:31] (03PS1) 10Krinkle: deployment-prep: add webperf to scap::dsh::groups [puppet] - 10https://gerrit.wikimedia.org/r/436586 (https://phabricator.wikimedia.org/T195314) [16:51:44] (03CR) 10Krinkle: "Confirmed on Beta. Works fine." [puppet] - 10https://gerrit.wikimedia.org/r/436435 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [16:52:00] (03CR) 10Volans: [C: 032] Revert "Reducing max length for varchar columns" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/436573 (owner: 10Volans) [16:52:03] (03CR) 10Krinkle: "Currently testing on Beta Cluster via deployment-puppetmaster." [puppet] - 10https://gerrit.wikimedia.org/r/436581 (https://phabricator.wikimedia.org/T161675) (owner: 10Krinkle) [16:52:07] (03CR) 10Krinkle: "Currently testing on Beta Cluster via deployment-puppetmaster." [puppet] - 10https://gerrit.wikimedia.org/r/436586 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [16:52:48] 10Operations, 10Analytics, 10Analytics-Wikistats, 10Domains, 10Traffic: HTTP 404 on stats.wikipedia.org (Domain not served) - https://phabricator.wikimedia.org/T195568#4246485 (10Krinkle) [16:53:22] (03Merged) 10jenkins-bot: Revert "Reducing max length for varchar columns" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/436573 (owner: 10Volans) [16:53:44] 10Operations, 10Analytics, 10Analytics-Wikistats, 10Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281#4246500 (10Krinkle) 05declined>03Open [16:53:53] (03PS2) 10Urbanecm: Assign movefile to autoreviewrs and patrollers on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436524 (https://phabricator.wikimedia.org/T195247) [16:54:15] 10Operations, 10Analytics, 10Analytics-Wikistats, 10Traffic, 10Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281#2010000 (10Krinkle) [16:54:19] 10Operations, 10Analytics, 10Analytics-Wikistats, 10Domains, 10Traffic: HTTP 404 on stats.wikipedia.org (Domain not served) - https://phabricator.wikimedia.org/T195568#4231062 (10Krinkle) [16:54:27] 10Operations, 10Analytics, 10Analytics-Wikistats, 10Traffic, 10Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281#2010000 (10Krinkle) [16:55:25] 10Operations, 10Analytics, 10Analytics-Wikistats, 10Traffic, 10Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281#2010000 (10Krinkle) >>! In T195568#4233129, @Dzahn wrote: > option a) delete stats record from the wi... [16:57:06] 10Operations, 10Datasets-General-or-Unknown, 10Dumps-Generation, 10hardware-requests, 10Patch-For-Review: Give misc dump crons their own host - https://phabricator.wikimedia.org/T181936#4246516 (10RobH) [16:57:13] (03PS4) 10Volans: MySQL config fine-tuning [software/debmonitor] - 10https://gerrit.wikimedia.org/r/436244 (https://phabricator.wikimedia.org/T191299) [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: It is that lovely time of the day again! You are hereby commanded to deploy Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180531T1700). [17:03:04] (03PS6) 10Herron: icinga-sms: use localhost as smtp server [puppet] - 10https://gerrit.wikimedia.org/r/429344 (https://phabricator.wikimedia.org/T82937) [17:03:11] (03CR) 10Volans: [C: 032] MySQL config fine-tuning [software/debmonitor] - 10https://gerrit.wikimedia.org/r/436244 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [17:03:57] (03PS2) 10Dzahn: planet: rm planet-venus feed templates, rename feeds_rawdog to feeds [puppet] - 10https://gerrit.wikimedia.org/r/436580 (https://phabricator.wikimedia.org/T180498) [17:03:59] (03PS2) 10Dzahn: planet: move plugin dir out of feeds dir [puppet] - 10https://gerrit.wikimedia.org/r/436583 [17:04:01] (03PS1) 10Dzahn: planet: remove jessie support and venus references [puppet] - 10https://gerrit.wikimedia.org/r/436589 (https://phabricator.wikimedia.org/T180498) [17:04:19] (03CR) 10Herron: [C: 032] icinga-sms: use localhost as smtp server [puppet] - 10https://gerrit.wikimedia.org/r/429344 (https://phabricator.wikimedia.org/T82937) (owner: 10Herron) [17:04:25] (03Merged) 10jenkins-bot: MySQL config fine-tuning [software/debmonitor] - 10https://gerrit.wikimedia.org/r/436244 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [17:05:28] (03PS1) 10Volans: Create a custom mysql backend and use it [software/debmonitor] - 10https://gerrit.wikimedia.org/r/436592 [17:05:42] (03CR) 10Paladox: [C: 031] planet: move plugin dir out of feeds dir [puppet] - 10https://gerrit.wikimedia.org/r/436583 (owner: 10Dzahn) [17:05:44] (03PS4) 10Dzahn: Planet: rawdog rss url for en [puppet] - 10https://gerrit.wikimedia.org/r/436584 (owner: 10Paladox) [17:06:25] (03CR) 10Dzahn: [C: 032] Planet: rawdog rss url for en [puppet] - 10https://gerrit.wikimedia.org/r/436584 (owner: 10Paladox) [17:07:16] (03CR) 10Paladox: [C: 031] planet: remove jessie support and venus references [puppet] - 10https://gerrit.wikimedia.org/r/436589 (https://phabricator.wikimedia.org/T180498) (owner: 10Dzahn) [17:08:02] 10Operations, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): elastic2018 not rebooting - https://phabricator.wikimedia.org/T196045#4246569 (10Gehel) It looks like this worked, elastic2018 looks good again. @Papaul is there any follow up we should do on that? Otherwise, feel free to close the ta... [17:08:19] (03CR) 10Dzahn: "deployment-tin should be gone. deployment-deploy1001 has been deleted. deployment-deploy-01 is the latest thing and using stretch now" [puppet] - 10https://gerrit.wikimedia.org/r/436284 (owner: 10Muehlenhoff) [17:11:16] hashar: are you around? [17:11:31] mutante: dinner dinner :] [17:11:51] ok, no worries. enjoy dinner [17:12:05] mutante: but shout and I will catch up once done [17:12:25] i want you to run puppet on ci::labs after i merge your change. that's all [17:12:42] if you say "already cherry-picked" then i just do it :) [17:12:59] gitcache change [17:13:28] (03PS2) 10Dzahn: ci: add VisualEditor and Wikibase to git cache [puppet] - 10https://gerrit.wikimedia.org/r/436512 (owner: 10Hashar) [17:16:12] PROBLEM - MariaDB Slave Lag: s3 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.94 seconds [17:16:12] PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.11 seconds [17:16:22] PROBLEM - MariaDB Slave Lag: s3 on db2036 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.05 seconds [17:16:43] PROBLEM - MariaDB Slave Lag: s3 on db2050 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.47 seconds [17:16:52] PROBLEM - MariaDB Slave Lag: s3 on db2043 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.21 seconds [17:17:03] PROBLEM - MariaDB Slave Lag: s3 on db2074 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.09 seconds [17:17:54] (03PS2) 10Volans: Create a custom mysql backend and use it [software/debmonitor] - 10https://gerrit.wikimedia.org/r/436592 [17:19:22] Those criticals are coming from the maintenance script that is being run [17:20:06] good to know,thx [17:26:56] (03PS3) 10Volans: Create a custom mysql backend and use it [software/debmonitor] - 10https://gerrit.wikimedia.org/r/436592 [17:27:54] 10Operations, 10Cassandra, 10Discovery, 10Maps: cassandra 2.2.6-wmf4 is not compatible with python 2.7.13 (debian stretch) - https://phabricator.wikimedia.org/T196044#4246652 (10Eevans) OK, I've pushed 2.2.6-wmf5 to http://people.wikimedia.org/~eevans (signed with key ID 8D77295D). It [[ https://github.co... [17:29:07] (03PS1) 10Pmiazga: Beta: Enable PP for newly created accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436596 (https://phabricator.wikimedia.org/T191888) [17:29:33] (03PS2) 10Pmiazga: beta: Enable PP for newly created accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436596 (https://phabricator.wikimedia.org/T191888) [17:29:36] (03CR) 10星耀晨曦: [C: 031] Assign movefile to autoreviewrs and patrollers on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436524 (https://phabricator.wikimedia.org/T195247) (owner: 10Urbanecm) [17:31:17] (03CR) 10Pmiazga: [C: 032] beta: Enable PP for newly created accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436596 (https://phabricator.wikimedia.org/T191888) (owner: 10Pmiazga) [17:32:24] (03Merged) 10jenkins-bot: beta: Enable PP for newly created accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436596 (https://phabricator.wikimedia.org/T191888) (owner: 10Pmiazga) [17:32:38] (03CR) 10jenkins-bot: beta: Enable PP for newly created accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436596 (https://phabricator.wikimedia.org/T191888) (owner: 10Pmiazga) [17:33:41] hey, quick question, if I merge a beta cluster config change [17:33:53] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:33:59] when it hits beta cluster? [17:34:16] (03PS1) 10Alex Monk: Kill the last role::puppet::self references [puppet] - 10https://gerrit.wikimedia.org/r/436600 (https://phabricator.wikimedia.org/T187622) [17:37:38] (03PS1) 10Krinkle: webperf: Add navtiming and coal to scap::sources [puppet] - 10https://gerrit.wikimedia.org/r/436601 (https://phabricator.wikimedia.org/T195314) [17:38:35] (03CR) 10Krinkle: "Checked-picked to beta's puppetmaster." [puppet] - 10https://gerrit.wikimedia.org/r/436601 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [17:39:43] raynor: Within 5min usually. [17:40:22] Thx Krinkle, I started checking everything and I just found the job in jenkins [17:41:49] (03CR) 10Andrew Bogott: [C: 032] Kill the last role::puppet::self references [puppet] - 10https://gerrit.wikimedia.org/r/436600 (https://phabricator.wikimedia.org/T187622) (owner: 10Alex Monk) [17:42:27] 10Puppet, 10Patch-For-Review, 10cloud-services-team (Kanban): Remove role::puppet::self and related support code - https://phabricator.wikimedia.org/T182810#4246701 (10Krenair) [17:42:31] 10Puppet, 10Patch-For-Review, 10cloud-services-team (Kanban): role::puppet::self referenced in puppet_ssldir.rb - https://phabricator.wikimedia.org/T187622#4246700 (10Krenair) 05Open>03Resolved [17:49:46] mutante, any idea what the plan is for icinga in prod? [17:58:46] addshore should we have the mediawiki-docker repo moved to gerrit and then mirrored to github? [18:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180531T1800). [18:00:05] No GERRIT patches in the queue for this window AFAICS. [18:00:15] davidwbarratt: no preference really from me [18:00:48] we have the wikibase-docker one also on github, i was thinking about moving it to gerrit, but then our CI is all setup on travis [18:01:18] it isn't part of the docker library though) just under the wikibase org [18:02:23] hmm [18:03:55] davidwbarratt: for the wikibase ones now we are also having a bundle image with more etensions https://github.com/wmde/wikibase-docker/blob/master/wikibase/README.md fyi [18:04:12] PROBLEM - MariaDB Slave Lag: s7 on db2061 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 391.21 seconds [18:06:42] RECOVERY - MariaDB Slave Lag: s3 on db2050 is OK: OK slave_sql_lag Replication lag: 11.91 seconds [18:06:52] RECOVERY - MariaDB Slave Lag: s3 on db2043 is OK: OK slave_sql_lag Replication lag: 0.26 seconds [18:07:03] RECOVERY - MariaDB Slave Lag: s3 on db2074 is OK: OK slave_sql_lag Replication lag: 0.44 seconds [18:07:12] RECOVERY - MariaDB Slave Lag: s3 on db2094 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [18:07:13] RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [18:07:23] RECOVERY - MariaDB Slave Lag: s3 on db2036 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [18:08:08] mutante: ah yeah i gotta cherry pick that is true. Doing so now [18:10:29] (03CR) 10Hashar: "I have forgot to cherry-pick / test this patch. Did it a minute ago and I ran puppet on integration-slave-docker-1003 but it is broken som" [puppet] - 10https://gerrit.wikimedia.org/r/436512 (owner: 10Hashar) [18:13:12] (03PS4) 10Paladox: gerrit: Ajust scap files (DO NOT MERGE) [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/414763 [18:14:45] (03PS1) 10Hashar: Revert "Kill the last role::puppet::self references" [puppet] - 10https://gerrit.wikimedia.org/r/436605 (https://phabricator.wikimedia.org/T187622) [18:16:07] (03CR) 10Hashar: "Applied on integration-puppetmaster and deployment-puppetmaster03." [puppet] - 10https://gerrit.wikimedia.org/r/436605 (https://phabricator.wikimedia.org/T187622) (owner: 10Hashar) [18:17:25] (03CR) 10Hashar: [C: 031] "Works!" [puppet] - 10https://gerrit.wikimedia.org/r/436512 (owner: 10Hashar) [18:17:44] mutante: https://gerrit.wikimedia.org/r/#/c/436512/ works [18:20:36] (03PS1) 10Alex Monk: puppet_ssldir: Fix to reintroduce sneaky check we just accidentally removed [puppet] - 10https://gerrit.wikimedia.org/r/436606 [18:22:17] (03CR) 10Andrew Bogott: [V: 032 C: 032] puppet_ssldir: Fix to reintroduce sneaky check we just accidentally removed [puppet] - 10https://gerrit.wikimedia.org/r/436606 (owner: 10Alex Monk) [18:28:25] (03PS1) 10Chad: WIP: 2.15.1 branch for wikimedia [software/gerrit] (stable-2.15) - 10https://gerrit.wikimedia.org/r/436607 [18:34:31] (03Abandoned) 10Hashar: Revert "Kill the last role::puppet::self references" [puppet] - 10https://gerrit.wikimedia.org/r/436605 (https://phabricator.wikimedia.org/T187622) (owner: 10Hashar) [18:38:51] (03CR) 10Krinkle: "This seems to work in that the key is read from common.yaml (Yay), but it fails because two of the entries here refer to an $lvs_service k" [puppet] - 10https://gerrit.wikimedia.org/r/436581 (https://phabricator.wikimedia.org/T161675) (owner: 10Krinkle) [18:51:12] PROBLEM - Disk space on elastic1026 is CRITICAL: DISK CRITICAL - free space: /srv 59121 MB (12% inode=99%) [18:54:22] PROBLEM - Disk space on elastic1026 is CRITICAL: DISK CRITICAL - free space: /srv 61750 MB (12% inode=99%) [18:58:44] (03PS3) 10ArielGlenn: use 6 parallel jobs for xml/sql dumps of big wikis [puppet] - 10https://gerrit.wikimedia.org/r/436578 [18:59:48] (03CR) 10ArielGlenn: [C: 032] use 6 parallel jobs for xml/sql dumps of big wikis [puppet] - 10https://gerrit.wikimedia.org/r/436578 (owner: 10ArielGlenn) [19:00:04] thcipriani: My dear minions, it's time we take the moon! Just kidding. Time for MediaWiki train deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180531T1900). [19:00:22] * thcipriani trains [19:01:11] (03CR) 10Aaron Schulz: profile::mediawiki::mcrouter_wancache: update the ssl paths (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/436531 (https://phabricator.wikimedia.org/T192771) (owner: 10Giuseppe Lavagetto) [19:02:32] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 40 probes of 321 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [19:02:41] thcipriani: choooo choooooo [19:03:14] (03CR) 10ArielGlenn: [C: 032] split up dumps temp dir into subdirs [dumps] - 10https://gerrit.wikimedia.org/r/434461 (https://phabricator.wikimedia.org/T182572) (owner: 10ArielGlenn) [19:03:50] addshore: each train for which I am responsible I take a lap around my house shouting exactly that. [19:04:24] !log ariel@tin Started deploy [dumps/dumps@038c8b3]: tempdir split into subdirs [19:04:28] !log ariel@tin Finished deploy [dumps/dumps@038c8b3]: tempdir split into subdirs (duration: 00m 04s) [19:04:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:16] !log mholloway-shell@tin Started deploy [tilerator/deploy@UNKNOWN] (cleartables): (no justification provided) [19:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:21] !log mholloway-shell@tin Finished deploy [tilerator/deploy@UNKNOWN] (cleartables): (no justification provided) (duration: 00m 06s) [19:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:13] RECOVERY - Disk space on elastic1026 is OK: DISK OK [19:16:33] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 22 probes of 301 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [19:16:40] thcipriani: i have a great image in my head right now [19:16:49] thank you for that :) [19:17:35] :D [19:17:37] no charge. [19:19:02] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [19:22:13] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:25:16] !log mholloway-shell@tin Started deploy [tilerator/deploy@UNKNOWN] (cleartables): (no justification provided) [19:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:21] !log mholloway-shell@tin Finished deploy [tilerator/deploy@UNKNOWN] (cleartables): (no justification provided) (duration: 00m 05s) [19:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:42] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 16 probes of 301 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [19:27:47] !log mholloway-shell@tin Started deploy [tilerator/deploy@UNKNOWN] (cleartables): (no justification provided) [19:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:16] !log mholloway-shell@tin Started deploy [tilerator/deploy@UNKNOWN] (cleartables): (no justification provided) [19:29:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:22] !log mholloway-shell@tin Finished deploy [tilerator/deploy@UNKNOWN] (cleartables): (no justification provided) (duration: 00m 06s) [19:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:43] !log mholloway-shell@tin Started deploy [tilerator/deploy@UNKNOWN] (cleartables): (no justification provided) [19:31:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:43] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 3 probes of 321 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [19:33:44] (03PS1) 10Thcipriani: All wikis to 1.32.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436618 (https://phabricator.wikimedia.org/T191052) [19:34:23] (03CR) 10Krinkle: Move scap::sources from role::deployment_server to common (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/436581 (https://phabricator.wikimedia.org/T161675) (owner: 10Krinkle) [19:35:36] (03CR) 10Thcipriani: [C: 032] All wikis to 1.32.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436618 (https://phabricator.wikimedia.org/T191052) (owner: 10Thcipriani) [19:37:17] (03Merged) 10jenkins-bot: All wikis to 1.32.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436618 (https://phabricator.wikimedia.org/T191052) (owner: 10Thcipriani) [19:37:28] (03PS1) 10Chad: Merge tag 'v2.15.2' into wmf/stable-2.15 [software/gerrit/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/436619 [19:37:40] (03PS2) 10Krinkle: webperf: Add navtiming and coal to scap::sources [puppet] - 10https://gerrit.wikimedia.org/r/436601 (https://phabricator.wikimedia.org/T195314) [19:38:05] (03PS3) 10Krinkle: webperf: Add navtiming and coal to scap::sources [puppet] - 10https://gerrit.wikimedia.org/r/436601 (https://phabricator.wikimedia.org/T195314) [19:38:39] (03PS4) 10Krinkle: webperf: Add navtiming and coal to scap::sources [puppet] - 10https://gerrit.wikimedia.org/r/436601 (https://phabricator.wikimedia.org/T195314) [19:38:48] (03CR) 10jenkins-bot: All wikis to 1.32.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436618 (https://phabricator.wikimedia.org/T191052) (owner: 10Thcipriani) [19:39:53] (03CR) 10Paladox: [C: 031] Merge tag 'v2.15.2' into wmf/stable-2.15 [software/gerrit/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/436619 (owner: 10Chad) [19:40:04] !log mholloway-shell@tin Started deploy [tilerator/deploy@2a26f1e] (cleartables): (no justification provided) [19:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:33] !log mholloway-shell@tin Started deploy [tilerator/deploy@2a26f1e] (cleartables): (no justification provided) [19:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:37] !log thcipriani@tin rebuilt and synchronized wikiversions files: all wikis to 1.32.0-wmf.6 [19:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:52] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [19:52:03] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:04:12] (03PS3) 10Dzahn: ci: add VisualEditor and Wikibase to git cache [puppet] - 10https://gerrit.wikimedia.org/r/436512 (owner: 10Hashar) [20:06:07] (03PS1) 10Alex Monk: ssh known_hosts: sort resources by certname [puppet] - 10https://gerrit.wikimedia.org/r/436624 [20:06:10] (03CR) 10BryanDavis: Read command line arguments from a config file (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/435691 (https://phabricator.wikimedia.org/T148872) (owner: 10Nehajha) [20:10:14] (03PS1) 10Herron: exim minimal: allow from local host interface addresses in rcpt acl [puppet] - 10https://gerrit.wikimedia.org/r/436626 (https://phabricator.wikimedia.org/T175361) [20:11:01] 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Set up puppet exported resources to collect ssh host keys for beta - https://phabricator.wikimedia.org/T72792#4247163 (10Krenair) It's working but it's very loud - I've made https://gerrit.wikimedia.org/r/#/c/436624/ to deal with that Also probably... [20:11:16] (03PS2) 10Herron: exim minimal: allow from local host interface addresses in rcpt acl [puppet] - 10https://gerrit.wikimedia.org/r/436626 (https://phabricator.wikimedia.org/T175361) [20:12:41] (03CR) 10Herron: [C: 032] exim minimal: allow from local host interface addresses in rcpt acl [puppet] - 10https://gerrit.wikimedia.org/r/436626 (https://phabricator.wikimedia.org/T175361) (owner: 10Herron) [20:13:45] (03CR) 10Alex Monk: "I suppose you could argue we should be doing this at the end of the query_resources definition in puppetdbquery. Thoughts?" [puppet] - 10https://gerrit.wikimedia.org/r/436624 (owner: 10Alex Monk) [20:16:30] (03CR) 10Bstorm: "puppetdbquery is clearly an open source module from the 'net that is copied in. I don't know if we care about it matching the upstream?" [puppet] - 10https://gerrit.wikimedia.org/r/436624 (owner: 10Alex Monk) [20:19:57] (03CR) 10Bstorm: [C: 031] "I like the solution, but this is going to run literally everywhere, so I'm curious if anyone else knows how this could go wrong if the sor" [puppet] - 10https://gerrit.wikimedia.org/r/436624 (owner: 10Alex Monk) [20:21:22] RECOVERY - MariaDB Slave Lag: s7 on db2061 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [20:21:46] (03PS1) 10Dzahn: assign 10.64.16.18 to phab1002.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/436631 [20:24:20] (03CR) 10Krinkle: "https://puppet-compiler.wmflabs.org/compiler02/11330/" [puppet] - 10https://gerrit.wikimedia.org/r/436601 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [20:25:13] (03CR) 10Dzahn: [C: 032] "unused IP, .8 = phab1001 .18 = phab1002" [dns] - 10https://gerrit.wikimedia.org/r/436631 (owner: 10Dzahn) [20:27:03] 10Operations, 10ops-eqiad, 10Patch-For-Review: setup/install phab1002(WMF4727) - https://phabricator.wikimedia.org/T196019#4247215 (10Dzahn) [20:29:58] (03CR) 10BryanDavis: "I thought we had already fixed this at some point, but apparently not." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/436570 (https://phabricator.wikimedia.org/T165337) (owner: 10Andrew Bogott) [20:38:04] (03PS4) 10Andrew Bogott: keystonehooks: Add any new project member to bastion [puppet] - 10https://gerrit.wikimedia.org/r/436570 (https://phabricator.wikimedia.org/T165337) [20:38:38] (03CR) 10jerkins-bot: [V: 04-1] keystonehooks: Add any new project member to bastion [puppet] - 10https://gerrit.wikimedia.org/r/436570 (https://phabricator.wikimedia.org/T165337) (owner: 10Andrew Bogott) [20:43:02] (03PS5) 10Andrew Bogott: keystonehooks: Add any new project member to bastion [puppet] - 10https://gerrit.wikimedia.org/r/436570 (https://phabricator.wikimedia.org/T165337) [20:47:26] (03CR) 10BryanDavis: [C: 031] "Untested, but it looks like the right stuff :)" [puppet] - 10https://gerrit.wikimedia.org/r/436570 (https://phabricator.wikimedia.org/T165337) (owner: 10Andrew Bogott) [20:49:44] (03PS1) 10Dzahn: install_server: add phab1001 to DHCP,netboot [puppet] - 10https://gerrit.wikimedia.org/r/436678 (https://phabricator.wikimedia.org/T196019) [20:50:30] (03PS2) 10Dzahn: install_server: add phab1002 to DHCP,netboot [puppet] - 10https://gerrit.wikimedia.org/r/436678 (https://phabricator.wikimedia.org/T196019) [20:52:35] (03CR) 10Dzahn: [C: 032] install_server: add phab1002 to DHCP,netboot [puppet] - 10https://gerrit.wikimedia.org/r/436678 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn) [20:57:44] !log mholloway-shell@tin Started deploy [tilerator/deploy@2a26f1e] (cleartables): (no justification provided) [20:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:51] !log mholloway-shell@tin Finished deploy [tilerator/deploy@2a26f1e] (cleartables): (no justification provided) (duration: 00m 07s) [20:57:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:54] !log dzahn@neodymium:~$ sudo wmf-auto-reimage-host --new phab1002.eqiad.wmnet (T196019) [21:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:59] T196019: setup/install phab1002(WMF4727) - https://phabricator.wikimedia.org/T196019 [21:04:13] (03PS4) 10Dzahn: ci: add VisualEditor and Wikibase to git cache [puppet] - 10https://gerrit.wikimedia.org/r/436512 (owner: 10Hashar) [21:12:29] (03CR) 10Dzahn: [C: 032] ci: add VisualEditor and Wikibase to git cache [puppet] - 10https://gerrit.wikimedia.org/r/436512 (owner: 10Hashar) [21:13:42] mutante: Danke :] [21:13:51] mutante: and it definitely worked for VisualEditor \o/ [21:20:53] PROBLEM - MariaDB Slave Lag: s7 on db2061 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 379.18 seconds [21:22:05] (03PS1) 10Dzahn: install_server: remove bast-test from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/436680 (https://phabricator.wikimedia.org/T186623) [21:22:35] hashar: de rien :) [21:23:36] (03PS2) 10Dzahn: install_server: remove bast-test from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/436680 (https://phabricator.wikimedia.org/T186623) [21:24:36] 10Operations, 10Analytics, 10Patch-For-Review: Puppet admin module should support adding system users to managed groups - https://phabricator.wikimedia.org/T174465#4247353 (10Ottomata) Another bump for my friends @akosiaris and @MoritzMuehlenhoff :) [21:24:37] (03CR) 10Dzahn: [C: 032] install_server: remove bast-test from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/436680 (https://phabricator.wikimedia.org/T186623) (owner: 10Dzahn) [21:26:58] (03PS1) 10Dzahn: remove bast-test.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/436681 (https://phabricator.wikimedia.org/T186623) [21:28:45] (03CR) 10Dzahn: [C: 032] "unused former test IP and host" [dns] - 10https://gerrit.wikimedia.org/r/436681 (https://phabricator.wikimedia.org/T186623) (owner: 10Dzahn) [21:29:17] (03CR) 10Alex Monk: "This thing is already doing an HTTP query against another host on the network so I would assume sorting <2000 items in a list is going to " [puppet] - 10https://gerrit.wikimedia.org/r/436624 (owner: 10Alex Monk) [21:41:03] (03CR) 10Alex Monk: "It looks like we had some ability to order by in I28bf175231ab025f367bb51bf83c41b29c83a2c6 but I65b1aeac097be7f24fb7f72695167b513e983302 r" [puppet] - 10https://gerrit.wikimedia.org/r/436624 (owner: 10Alex Monk) [21:43:35] (03CR) 10Alex Monk: "And in fact, this was previously ordered until that commit: I47d1f6492f02b38765cbd040acd0a5710bef69e6" [puppet] - 10https://gerrit.wikimedia.org/r/436624 (owner: 10Alex Monk) [21:51:56] (03PS1) 10Dzahn: fix IP address for phab1002, was wrong row [dns] - 10https://gerrit.wikimedia.org/r/436685 (https://phabricator.wikimedia.org/T196019) [21:55:28] !log temporarily reducing s4-codfw-master consistency to aliviate lag (binlog_sync, flush_log) [21:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:36] !log mholloway-shell@tin Started deploy [tilerator/deploy@2a26f1e] (cleartables): (no justification provided) [21:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:20] (03CR) 10Dzahn: [C: 032] fix IP address for phab1002, was wrong row [dns] - 10https://gerrit.wikimedia.org/r/436685 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn) [22:11:22] !log mholloway-shell@tin Finished deploy [tilerator/deploy@2a26f1e] (cleartables): (no justification provided) (duration: 12m 46s) [22:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:42] 10Operations, 10Wikimedia-Mailing-lists: Create new editing-team mailing list - https://phabricator.wikimedia.org/T196120#4247475 (10kaldari) [22:38:00] (03CR) 10Bstorm: [C: 032] "Yeah, it seems pretty unlikely that little bit of processing would be an issue. I'll merge it. We can always go back if it turns out to " [puppet] - 10https://gerrit.wikimedia.org/r/436624 (owner: 10Alex Monk) [22:38:11] (03PS2) 10Bstorm: ssh known_hosts: sort resources by certname [puppet] - 10https://gerrit.wikimedia.org/r/436624 (owner: 10Alex Monk) [22:40:00] (03CR) 10Bstorm: [C: 031] "On second thought, I'll wait on that until people can take a look in the morning, just in case ;-)" [puppet] - 10https://gerrit.wikimedia.org/r/436624 (owner: 10Alex Monk) [22:48:56] !log pnorman@tin Started deploy [tilerator/deploy@2a26f1e] (cleartables): Redeploy to 2004 to try to reproduce error [22:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:47] (03CR) 10Krinkle: [C: 031] Drop the UnicodeConverter extension from production, part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436331 (https://phabricator.wikimedia.org/T195941) (owner: 10Jforrester) [22:49:55] (03CR) 10Krinkle: [C: 031] Drop the UnicodeConverter extension from production, part 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436332 (https://phabricator.wikimedia.org/T195941) (owner: 10Jforrester) [22:50:01] (03CR) 10Krinkle: [C: 031] Drop the UnicodeConverter extension from production, part 3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436333 (https://phabricator.wikimedia.org/T195941) (owner: 10Jforrester) [22:50:19] (03CR) 10Krinkle: [C: 031] Drop the UnicodeConverter extension from production, part 4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436334 (https://phabricator.wikimedia.org/T195941) (owner: 10Jforrester) [22:51:10] !log pnorman@tin Finished deploy [tilerator/deploy@2a26f1e] (cleartables): Redeploy to 2004 to try to reproduce error (duration: 02m 14s) [22:51:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:26] (03PS5) 10Krinkle: webperf: Add navtiming and coal to scap::sources [puppet] - 10https://gerrit.wikimedia.org/r/436601 (https://phabricator.wikimedia.org/T195314) [22:59:48] !log pnorman@tin Started deploy [tilerator/deploy@709ca69] (cleartables): Redeploy to 2004 to try to reproduce error [22:59:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy Evening SWAT (Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180531T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:02:09] !log pnorman@tin Finished deploy [tilerator/deploy@709ca69] (cleartables): Redeploy to 2004 to try to reproduce error (duration: 02m 22s) [23:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:47] 10Operations, 10ops-eqiad, 10Patch-For-Review: setup/install phab1002(WMF4727) - https://phabricator.wikimedia.org/T196019#4247544 (10Dzahn) in the installer i selected the very last step to install grub manually. next was: ``` [!!] Install the GRUB boot loader on a hard disk ├┐ │... [23:21:53] 10Operations, 10Wikimedia-Planet, 10Patch-For-Review, 10User-notice: planet.wikimedia.org: replace planet-venus software with rawdog - https://phabricator.wikimedia.org/T180498#4247546 (10Liuxinyu970226) [23:27:33] 10Operations, 10ops-eqiad, 10Patch-For-Review: setup/install phab1002(WMF4727) - https://phabricator.wikimedia.org/T196019#4247549 (10Paladox) sounds like it expected a ssd. [23:28:06] !log pnorman@tin Started deploy [tilerator/deploy@709ca69] (cleartables): Redeploy to 2004 to try to reproduce error [23:28:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:03] !log pnorman@tin Finished deploy [tilerator/deploy@709ca69] (cleartables): Redeploy to 2004 to try to reproduce error (duration: 06m 57s) [23:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:13] !log pnorman@tin Started deploy [tilerator/deploy@709ca69] (cleartables): Redeploy to 2004 to try to reproduce error [23:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:34] !log pnorman@tin Finished deploy [tilerator/deploy@709ca69] (cleartables): Redeploy to 2004 to try to reproduce error (duration: 02m 22s) [23:38:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:20] 10Operations, 10ops-eqiad: WMF4727 hardware issue - disks dont detect in installer - https://phabricator.wikimedia.org/T189804#4247560 (10Dzahn) 05Resolved>03Open reopening. got this same box assigned as a spare for something different and it has the same issue [23:42:25] 10Operations, 10ops-eqiad, 10Patch-For-Review: setup/install phab1002(WMF4727) - https://phabricator.wikimedia.org/T196019#4247566 (10Dzahn) See T189804 and T190093. This is the same spare machine that had this same issue before when i got it for bastion host replacement in the past. I reopened one of those. [23:46:48] 10Operations, 10ops-eqiad: WMF4727 hardware issue - disks dont detect in installer - https://phabricator.wikimedia.org/T189804#4247573 (10Dzahn) Issue is back with a new host name and a new install on the same hardware in 196019#4247544 [23:53:24] 10Operations, 10ops-eqiad: WMF4727 hardware issue - disks dont detect in installer - https://phabricator.wikimedia.org/T189804#4247577 (10Dzahn) It's possible that the issue Rob described on this ticket isn't identical what i describe above, but mine is just like T190093 and we closed that as duplicate of thi... [23:59:25] !log pnorman@tin Started deploy [tilerator/deploy@709ca69] (cleartables): Redeploy to 2004 to try to reproduce error [23:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log