[00:06:09] (03PS1) 10Bstorm: maintain_dbusers: Reverting to the old location to save git history [puppet] - 10https://gerrit.wikimedia.org/r/491189 (https://phabricator.wikimedia.org/T216373) [00:08:18] (03PS2) 10Bstorm: maintain_dbusers: Reverting to the old location to save git history [puppet] - 10https://gerrit.wikimedia.org/r/491189 (https://phabricator.wikimedia.org/T216373) [00:11:36] (03CR) 10Bstorm: "My whole idea here is that I don't want to be stuck working on copies of files without a history if we can avoid it (easy to do in ops/pup" [puppet] - 10https://gerrit.wikimedia.org/r/491189 (https://phabricator.wikimedia.org/T216373) (owner: 10Bstorm) [01:17:26] 10Operations, 10Security, 10Surveys: Re-evaluate Limesurvey - https://phabricator.wikimedia.org/T109606 (10Aklapper) a:05egalvezwmf→03None https://meta.wikimedia.org/wiki/Surveys implies that Qualtrics is currently used. Proposing to decline this task as I see noone driving a comparison (or questioning... [01:19:14] (03PS11) 10Mathew.onipe: Add wdqs data transfer cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) [01:19:50] (03CR) 10Mathew.onipe: Add wdqs data transfer cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) (owner: 10Mathew.onipe) [01:38:15] (03PS1) 10Mathew.onipe: maps: migrate maps2003 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/491191 (https://phabricator.wikimedia.org/T198622) [05:52:22] !log Set dbstore1002 on read only to start the migration T210478 T215589 [05:52:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:26] T210478: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 [05:52:27] T215589: Migrate users to dbstore100[3-5] - https://phabricator.wikimedia.org/T215589 [05:55:23] !log Deploy schema change on s8 primary master (db1071) - T210713 [05:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:25] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [06:03:55] (03PS1) 10Marostegui: dbstore1002: Set the host to read-only [puppet] - 10https://gerrit.wikimedia.org/r/491195 (https://phabricator.wikimedia.org/T210478) [06:17:29] 10Operations, 10Cloud-VPS, 10netops, 10User-Marostegui, 10cloud-services-team (Kanban): toolsdb: firewalling changes for new setup (temporal mysql replication) - https://phabricator.wikimedia.org/T216353 (10Bstorm) p:05Low→03High Drat! The priority on the last fixup of adding those three IP addresse... [06:18:25] 10Operations, 10Cloud-VPS, 10netops, 10User-Marostegui, 10cloud-services-team (Kanban): toolsdb: firewalling changes for new setup (temporal mysql replication) - https://phabricator.wikimedia.org/T216353 (10Bstorm) [06:18:50] 10Operations, 10Cloud-VPS, 10netops, 10User-Marostegui, 10cloud-services-team (Kanban): toolsdb: firewalling changes for new setup (temporal mysql replication) - https://phabricator.wikimedia.org/T216353 (10bd808) We need to be able to have labstore100[45] and labsdb1004 talk to port 3306 on clouddb1001.... [06:22:37] PROBLEM - HHVM jobrunner on mw1309 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.074 second response time [06:23:49] RECOVERY - HHVM jobrunner on mw1309 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.081 second response time [06:24:40] (03PS1) 10Marostegui: db-eqiad.php: Depool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491196 [06:26:49] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491196 (owner: 10Marostegui) [06:28:00] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491196 (owner: 10Marostegui) [06:28:31] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:29:23] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1119 for mysql upgrade (duration: 01m 01s) [06:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:45] !log Stop MySQL on db1119 for mysql and kernel upgrade [06:29:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:18] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491196 (owner: 10Marostegui) [06:38:17] marostegui: o/ - all in progress? [06:38:31] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational [06:38:48] elukey: yep! [06:38:59] nice! thanks [06:38:59] elukey: you've got a gerrit review :) [06:39:01] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491197 [06:40:12] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491197 (owner: 10Marostegui) [06:41:18] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491197 (owner: 10Marostegui) [06:42:23] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1119 after mysql upgrade (duration: 00m 46s) [06:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:46] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491197 (owner: 10Marostegui) [06:49:05] (03PS1) 10Marostegui: db-eqiad.php: Repool db1119 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491198 [06:49:31] !log Reboot db2085 to disable debug mode on kernel T216273 [06:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:34] T216273: New cronspam from db clusters - https://phabricator.wikimedia.org/T216273 [06:50:55] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Repool db1119 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491198 (owner: 10Marostegui) [06:52:03] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1119 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491198 (owner: 10Marostegui) [06:53:10] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: repool db1119 into API service after mysql upgrade (duration: 00m 46s) [06:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:49] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1119 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491198 (owner: 10Marostegui) [06:56:57] (03CR) 10Elukey: [C: 03+1] dbstore1002: Set the host to read-only [puppet] - 10https://gerrit.wikimedia.org/r/491195 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui) [06:57:02] \o/ [06:57:08] \o/ [06:57:10] (03CR) 10Marostegui: [C: 03+2] dbstore1002: Set the host to read-only [puppet] - 10https://gerrit.wikimedia.org/r/491195 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui) [06:59:00] 10Operations, 10ops-codfw, 10DBA: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Marostegui) [06:59:36] 10Operations, 10ops-codfw, 10DBA: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Marostegui) I have rebooted db2085 without debug option on kernel as part of (T216273) and I have taken the opportunity to upgrade its kernel too. [07:01:51] 10Operations: New cronspam from db clusters - https://phabricator.wikimedia.org/T216273 (10Marostegui) db2085 has been rebooted - let's see if that stops the amount of emails. [07:03:12] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491199 [07:05:17] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491199 (owner: 10Marostegui) [07:06:31] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491199 (owner: 10Marostegui) [07:06:43] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491199 (owner: 10Marostegui) [07:07:37] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1119 after mysql upgrade (duration: 00m 46s) [07:07:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:57] (03PS2) 10Muehlenhoff: imagemagick: Unconditionally use /etc/ImageMagick-6/ [puppet] - 10https://gerrit.wikimedia.org/r/487888 [07:58:50] (03CR) 10Muehlenhoff: [C: 03+2] imagemagick: Unconditionally use /etc/ImageMagick-6/ [puppet] - 10https://gerrit.wikimedia.org/r/487888 (owner: 10Muehlenhoff) [07:59:27] (03PS1) 10Elukey: camus: fix webrequest testing Kafka topic [puppet] - 10https://gerrit.wikimedia.org/r/491212 [08:00:30] (03CR) 10Elukey: [C: 03+2] camus: fix webrequest testing Kafka topic [puppet] - 10https://gerrit.wikimedia.org/r/491212 (owner: 10Elukey) [08:00:36] (03PS2) 10Elukey: camus: fix webrequest testing Kafka topic [puppet] - 10https://gerrit.wikimedia.org/r/491212 [08:00:45] (03CR) 10Elukey: [V: 03+2 C: 03+2] camus: fix webrequest testing Kafka topic [puppet] - 10https://gerrit.wikimedia.org/r/491212 (owner: 10Elukey) [08:09:52] 10Operations, 10Scap: Remove trusty-specific hacks from logstash_checker.py - https://phabricator.wikimedia.org/T216380 (10MoritzMuehlenhoff) [08:23:45] !log Deploy schema change on s1 codfw master (db2048), lag will be generated on s1 codfw - T210713 [08:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:48] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [08:27:21] !log Drop ep_* tables from s5 (srwiki) - T174802 [08:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:24] T174802: Archive and drop education program (ep_*) tables on all wikis - https://phabricator.wikimedia.org/T174802 [08:28:50] 10Operations: Integrate Stretch 9.8 point update - https://phabricator.wikimedia.org/T216384 (10MoritzMuehlenhoff) [08:35:41] (03PS1) 10Elukey: camus: fix webrequest_test Kafka topic whitelist [puppet] - 10https://gerrit.wikimedia.org/r/491214 [08:36:41] (03CR) 10Elukey: [C: 03+2] camus: fix webrequest_test Kafka topic whitelist [puppet] - 10https://gerrit.wikimedia.org/r/491214 (owner: 10Elukey) [09:08:13] !log Deploy schema change on dbstore1003:3311 and dbstore1001:3311 - T210713 [09:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:16] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [09:13:04] (03PS1) 10Marostegui: db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491218 (https://phabricator.wikimedia.org/T210713) [09:14:20] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491218 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [09:15:32] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491218 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [09:16:39] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1099:3311 T210713 (duration: 00m 48s) [09:16:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:42] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [09:16:43] !log Deploy schema change on db1099:3311 - T210713 [09:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:47] (03PS1) 10Elukey: Reserve IP for kerberos1001.eqiad.wment (Ganeti VM) [dns] - 10https://gerrit.wikimedia.org/r/491219 (https://phabricator.wikimedia.org/T216238) [09:20:34] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491218 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [09:21:46] 10Operations: Integrate Stretch 9.8 point update - https://phabricator.wikimedia.org/T216384 (10MoritzMuehlenhoff) [09:21:53] 10Operations: Integrate Stretch 9.8 point update - https://phabricator.wikimedia.org/T216384 (10MoritzMuehlenhoff) These packages are not used in our production infrastructure: - arc - astroml-addons - chkrootkit - compactheader - courier - debian-edu-config - debian-installer - debian-installer-netboot-images... [09:24:29] (03PS1) 10Elukey: WIP: Introduce kerberos1001 [puppet] - 10https://gerrit.wikimedia.org/r/491222 (https://phabricator.wikimedia.org/T216238) [09:25:13] (03CR) 10jerkins-bot: [V: 04-1] WIP: Introduce kerberos1001 [puppet] - 10https://gerrit.wikimedia.org/r/491222 (https://phabricator.wikimedia.org/T216238) (owner: 10Elukey) [09:25:35] heh [09:28:42] !log Drop ep_* from s6 (ruwiki) - T174802 [09:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:45] T174802: Archive and drop education program (ep_*) tables on all wikis - https://phabricator.wikimedia.org/T174802 [09:30:15] (03Abandoned) 10Elukey: WIP: Introduce kerberos1001 [puppet] - 10https://gerrit.wikimedia.org/r/491222 (https://phabricator.wikimedia.org/T216238) (owner: 10Elukey) [09:33:10] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1099:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491224 [09:35:59] (03CR) 10Volans: [C: 04-1] "I think it doesn't do yet what you'd like ;)" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/490866 (https://phabricator.wikimedia.org/T207920) (owner: 10Gehel) [09:42:47] 10Operations, 10vm-requests, 10Patch-For-Review: eqiad: (1) Ganeti VM for testing Kerberos in Production - https://phabricator.wikimedia.org/T216238 (10elukey) So after reading https://wikitech.wikimedia.org/wiki/Ganeti#Create_a_VM this is what I'd do: 1) Review/Merge https://gerrit.wikimedia.org/r/491219 t... [09:44:33] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1099:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491224 (owner: 10Marostegui) [09:45:42] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1099:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491224 (owner: 10Marostegui) [09:46:47] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1099:3311 T210713 (duration: 00m 46s) [09:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:58] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [09:47:16] (03PS1) 10Marostegui: db-eqiad.php: Depool db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491225 (https://phabricator.wikimedia.org/T210713) [09:48:34] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491225 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [09:49:39] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491225 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [09:50:39] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1105:3311 T210713 (duration: 00m 46s) [09:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:45] !log Deploy schema change on db1105:3311 T210713 [09:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:58] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1099:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491224 (owner: 10Marostegui) [09:53:00] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491225 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [10:06:01] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1105:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491227 [10:08:07] 10Operations: New cronspam from db clusters - https://phabricator.wikimedia.org/T216273 (10MoritzMuehlenhoff) >>! In T216273#4960706, @Marostegui wrote: > db2085 has been rebooted - let's see if that stops the amount of emails. I re-ran the auto restarts manually on db2085 and that didn't lead to any new Cron m... [10:08:48] 10Operations: New cronspam from db clusters - https://phabricator.wikimedia.org/T216273 (10Marostegui) a:03Marostegui I will take care of db1106 as I need to depool it anyways today or tomorrow. [10:11:57] (03PS1) 10Gilles: Launch performance perception survey on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491229 (https://phabricator.wikimedia.org/T187299) [10:21:11] 10Operations, 10Analytics, 10hardware-requests: GPU upgrade for stat1005 - https://phabricator.wikimedia.org/T216226 (10elukey) Thanks all for all the detailed info! One thought: I found this interesting use case https://www.amd.com/en/case-studies/school-42 among the case studies in the AMD website, that s... [10:22:26] (03PS3) 10Gehel: elasticsearch: retry on TransportError while waiting for node to be up [software/spicerack] - 10https://gerrit.wikimedia.org/r/490866 (https://phabricator.wikimedia.org/T207920) [10:24:09] (03PS1) 10DCausse: [cirrus] reduce master timeout to 30s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491231 (https://phabricator.wikimedia.org/T215969) [10:26:13] (03CR) 10Gehel: elasticsearch: retry on TransportError while waiting for node to be up (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/490866 (https://phabricator.wikimedia.org/T207920) (owner: 10Gehel) [10:27:20] (03PS1) 10Elukey: Set stat1005 OS install settings back to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/491232 [10:28:14] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: retry on TransportError while waiting for node to be up [software/spicerack] - 10https://gerrit.wikimedia.org/r/490866 (https://phabricator.wikimedia.org/T207920) (owner: 10Gehel) [10:28:16] (03CR) 10Elukey: [C: 03+2] Set stat1005 OS install settings back to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/491232 (owner: 10Elukey) [10:31:26] (03PS4) 10Gehel: elasticsearch: retry on TransportError while waiting for node to be up [software/spicerack] - 10https://gerrit.wikimedia.org/r/490866 (https://phabricator.wikimedia.org/T207920) [10:34:10] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/490866 (https://phabricator.wikimedia.org/T207920) (owner: 10Gehel) [10:34:54] (03CR) 10Gehel: [C: 03+2] elasticsearch: retry on TransportError while waiting for node to be up [software/spicerack] - 10https://gerrit.wikimedia.org/r/490866 (https://phabricator.wikimedia.org/T207920) (owner: 10Gehel) [10:35:24] (03CR) 10Effie Mouzeli: [C: 03+2] "I will do the changes you both mentioned in the next patch, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/490610 (https://phabricator.wikimedia.org/T214597) (owner: 10Effie Mouzeli) [10:40:04] !log Drop tables ep_* from s2 (cswiki nlwiki ptwiki svwiki) T174802 [10:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:08] T174802: Archive and drop education program (ep_*) tables on all wikis - https://phabricator.wikimedia.org/T174802 [10:40:33] (03Merged) 10jenkins-bot: elasticsearch: retry on TransportError while waiting for node to be up [software/spicerack] - 10https://gerrit.wikimedia.org/r/490866 (https://phabricator.wikimedia.org/T207920) (owner: 10Gehel) [10:41:33] (03CR) 10jenkins-bot: elasticsearch: retry on TransportError while waiting for node to be up [software/spicerack] - 10https://gerrit.wikimedia.org/r/490866 (https://phabricator.wikimedia.org/T207920) (owner: 10Gehel) [10:42:40] (03PS2) 10Effie Mouzeli: Upgrade thumbor2002 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/490610 (https://phabricator.wikimedia.org/T214597) [10:52:09] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1105:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491227 (owner: 10Marostegui) [10:52:49] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: Degraded RAID on thumbor2002 - https://phabricator.wikimedia.org/T214813 (10jijiki) Server will be re-imaged to stretch as part of upgrading Thumbor servers to stretch - T214597 [10:53:17] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1105:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491227 (owner: 10Marostegui) [10:53:30] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10jijiki) [10:53:33] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: Degraded RAID on thumbor2002 - https://phabricator.wikimedia.org/T214813 (10jijiki) [10:53:36] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10jijiki) [10:53:43] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: Degraded RAID on thumbor2002 - https://phabricator.wikimedia.org/T214813 (10jijiki) 05Open→03Resolved [10:53:46] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10jijiki) [10:54:01] !log Reimaging thumbor2002 to stretch - T214597 [10:54:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:05] T214597: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 [10:54:11] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1105:3311 T210713 (duration: 00m 46s) [10:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:13] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [10:58:33] (03PS1) 10Ladsgroup: Change Special:ItemDisambiguation from blank special page to disabled page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491237 (https://phabricator.wikimedia.org/T216397) [10:59:58] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1105:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491227 (owner: 10Marostegui) [11:03:27] 10Operations: Integrate Stretch 9.8 point update - https://phabricator.wikimedia.org/T216384 (10MoritzMuehlenhoff) [11:04:36] (03PS7) 10Elukey: service::node: add the 'use_nodejs10' parameter [puppet] - 10https://gerrit.wikimedia.org/r/477475 (https://phabricator.wikimedia.org/T210704) [11:07:49] (03PS8) 10Elukey: service::node: add the 'use_nodejs10' parameter [puppet] - 10https://gerrit.wikimedia.org/r/477475 (https://phabricator.wikimedia.org/T210704) [11:11:08] (03PS9) 10Elukey: service::node: add the 'use_nodejs10' parameter [puppet] - 10https://gerrit.wikimedia.org/r/477475 (https://phabricator.wikimedia.org/T210704) [11:11:09] 10Operations: Integrate Stretch 9.8 point update - https://phabricator.wikimedia.org/T216384 (10MoritzMuehlenhoff) [11:11:57] !log installing c3p0 security updates [11:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:30] (03CR) 10Elukey: "Pcc of the rebased version https://puppet-compiler.wmflabs.org/compiler1001/14714/" [puppet] - 10https://gerrit.wikimedia.org/r/477475 (https://phabricator.wikimedia.org/T210704) (owner: 10Elukey) [11:17:03] 10Operations: Integrate Stretch 9.8 point update - https://phabricator.wikimedia.org/T216384 (10MoritzMuehlenhoff) [11:27:31] (03PS1) 10Muehlenhoff: Add library hint for uriparser [puppet] - 10https://gerrit.wikimedia.org/r/491239 [11:27:52] (03PS2) 10Muehlenhoff: Add library hint for uriparser [puppet] - 10https://gerrit.wikimedia.org/r/491239 [11:30:52] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin2001.codfw.wmnet for hosts: ` thumbor2002.codfw.wmnet ` The log can be found in... [11:32:35] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for uriparser [puppet] - 10https://gerrit.wikimedia.org/r/491239 (owner: 10Muehlenhoff) [11:36:34] !log installing uriparser security updates [11:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:56] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['thumbor2002.codfw.wmnet'] ` Of which those **FAILED**: ` ['thumbor2002.codfw.wmnet'] ` [11:39:19] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin2001.codfw.wmnet for hosts: ` thumbor2002.codfw.wmnet ` The log can be found in... [11:39:23] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['thumbor2002.codfw.wmnet'] ` Of which those **FAILED**: ` ['thumbor2002.codfw.wmnet'] ` [11:40:13] 10Operations: Integrate Stretch 9.8 point update - https://phabricator.wikimedia.org/T216384 (10MoritzMuehlenhoff) [11:41:20] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin2001.codfw.wmnet for hosts: ` thumbor2002.codfw.wmnet ` The log can be found in... [11:41:23] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['thumbor2002.codfw.wmnet'] ` Of which those **FAILED**: ` ['thumbor2002.codfw.wmnet'] ` [11:44:11] 10Operations: Integrate Stretch 9.5 point release - https://phabricator.wikimedia.org/T199670 (10MoritzMuehlenhoff) 05Open→03Resolved These updates have been fully deployed: ` ca-certificates postgresql-common ganeti postgresql-9.6 (mostly rolled out, remaining server superseded by 9.8 update) ` [11:44:21] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin2001.codfw.wmnet for hosts: ` thumbor2002.codfw.wmnet ` The log can be found in... [11:44:43] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): Remove trusty-specific hacks from logstash_checker.py - https://phabricator.wikimedia.org/T216380 (10zeljkofilipin) [11:52:36] (03PS2) 10Giuseppe Lavagetto: Add etcd3 driver [software/conftool] - 10https://gerrit.wikimedia.org/r/359919 [11:53:12] !log installing hdparm bugfix update from stretch point release [11:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:37] (03CR) 10jerkins-bot: [V: 04-1] Add etcd3 driver [software/conftool] - 10https://gerrit.wikimedia.org/r/359919 (owner: 10Giuseppe Lavagetto) [11:57:37] 10Operations, 10Traffic: Content purges are unreliable - https://phabricator.wikimedia.org/T133821 (10Bawolff) [12:04:57] 10Operations, 10Cloud-VPS, 10netops, 10User-Marostegui, 10cloud-services-team (Kanban): toolsdb: firewalling changes for new setup (temporal mysql replication) - https://phabricator.wikimedia.org/T216353 (10aborrero) Verifying that CR is indeed blocking the connections: ` aborrero@labstore1004:~$ telnet... [12:06:09] (03CR) 10Muehlenhoff: "Bikeshedding! IMHO kerberos isn't such a great name, there'll be several hosts related to Kerberos at some point, how about kdc1001? (One " [puppet] - 10https://gerrit.wikimedia.org/r/491222 (https://phabricator.wikimedia.org/T216238) (owner: 10Elukey) [12:07:51] moritzm: sure, I have no opposition :) I thought kerberos1001 since it was a meaningful name for whoever is not following the testing/migration (rather than kdc) [12:09:52] my point is that we don't know for 100% how the final servers will look like and if we pick kerberos* that felt like a deja vu from naming the Hadoop servers analytics* :-) [12:10:31] ah sure this VM will be nuked at some point [12:10:32] but feel free to pick any name, just a thought from the peanut gallery :-) [12:10:45] I am planning to keep it around only for testing [12:11:28] ack, any name works then :-) [12:11:52] servermcserverface1001 [12:15:01] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['thumbor2002.codfw.wmnet'] ` Of which those **FAILED**: ` ['thumbor2002.codfw.wmnet'] ` [12:15:09] thanks bawolff - that was fast :) [12:15:26] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin2001.codfw.wmnet for hosts: ` thumbor2002.codfw.wmnet ` The log can be found in... [12:16:00] Its the standard CR response time - either it happens in 5 minutes or it takes 5 years [12:16:29] lol ain't that true :) [12:17:21] hauskatze: If you want, I could tell you how to add the username to it (or i could just do it if you prefer) [12:17:28] (03PS2) 10Giuseppe Lavagetto: tlsproxy::instance: move under profile namespace [puppet] - 10https://gerrit.wikimedia.org/r/488509 [12:18:33] bawolff: I think it'd be easier if you did it yourself if you want to. [12:18:42] I don't want to take credit of other's work [12:18:48] ok [12:19:16] probably we should do it for centralauth-error-locked too [12:19:32] note that we have login-error and -error [12:19:43] feel free to add a task :) [12:20:20] although I'm in the middle of splitting API/non-API messages from extension-CentralAuth and that'd cause a merge conflict, heh [12:20:37] https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/CentralAuth/+/489481/ [12:22:55] Guess i should probably test this somehow. worst part of touching central auth, is everytime i have to re-set it up in order to test it [12:23:18] (03PS3) 10Giuseppe Lavagetto: tlsproxy::instance: move under profile namespace [puppet] - 10https://gerrit.wikimedia.org/r/488509 [12:24:19] it probably won't conflict with that patch [12:27:48] 10Operations, 10Cloud-VPS, 10netops, 10User-Marostegui, 10cloud-services-team (Kanban): toolsdb: firewalling changes for new setup (temporal mysql replication) - https://phabricator.wikimedia.org/T216353 (10ayounsi) `lang=diff [edit firewall family inet filter cloud-in4] term labsdb { ... } +... [12:28:31] !log update clouddb_return term from cloud-in4 on cr1/2-eqiad - T216353 [12:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:35] T216353: toolsdb: firewalling changes for new setup (temporal mysql replication) - https://phabricator.wikimedia.org/T216353 [12:29:52] lol, its even documented as accepting a username as $1 in qqq, eventhough it doesn't [12:30:58] 10Operations, 10Cloud-VPS, 10netops, 10User-Marostegui, 10cloud-services-team (Kanban): toolsdb: firewalling changes for new setup (temporal mysql replication) - https://phabricator.wikimedia.org/T216353 (10aborrero) It works! ` aborrero@labstore1004:~ $ telnet 172.16.7.153 3306 Trying 172.16.7.153... C... [12:31:13] 10Operations, 10Cloud-VPS, 10netops, 10User-Marostegui, 10cloud-services-team (Kanban): toolsdb: firewalling changes for new setup (temporal mysql replication) - https://phabricator.wikimedia.org/T216353 (10aborrero) 05Open→03Resolved [12:31:23] !log installing upgrading stat1005 to buster [12:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:22] !log installing brltty bugfix update from stretch point release [12:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:45] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10Gilles) Let me know when the host is ready for testing [12:48:13] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/14718/ shows the change is a noop on all nodes that include tlsproxy::instance right now." [puppet] - 10https://gerrit.wikimedia.org/r/488509 (owner: 10Giuseppe Lavagetto) [12:48:29] <_joe_> uhm actually [12:48:36] 10Operations: Integrate Stretch 9.6 point update - https://phabricator.wikimedia.org/T209260 (10MoritzMuehlenhoff) These updates have been fully deployed: ` brltty hdparm libseccomp systemd unbound ` [12:51:31] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10jijiki) [12:51:34] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: Degraded RAID on thumbor2002 - https://phabricator.wikimedia.org/T214813 (10jijiki) 05Resolved→03Open @papaul I am unable to reimage the server because PXE boot is failing. Server says: ` Broadcom UNDI PXE-2.1 v16.4.3 Copyright (C) 2000-2014 Bro... [12:51:36] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10jijiki) [12:53:25] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10jijiki) @Gilles It looks like we have some issues with thumbor2002, we are investigating if we can continue the upgrade with other host. [12:55:39] (03PS1) 10Joal: Update sqoop launchers used by timers [puppet] - 10https://gerrit.wikimedia.org/r/491246 [12:55:53] elukey: --^ For when you're back :) [13:13:16] (03PS1) 10Muehlenhoff: Remove access for dartar [puppet] - 10https://gerrit.wikimedia.org/r/491248 [13:15:44] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['thumbor2002.codfw.wmnet'] ` Of which those **FAILED**: ` ['thumbor2002.codfw.wmnet'] ` [13:16:16] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for dartar [puppet] - 10https://gerrit.wikimedia.org/r/491248 (owner: 10Muehlenhoff) [13:23:11] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] maintain_dbusers: Reverting to the old location to save git history [puppet] - 10https://gerrit.wikimedia.org/r/491189 (https://phabricator.wikimedia.org/T216373) (owner: 10Bstorm) [13:25:27] (03Abandoned) 10Giuseppe Lavagetto: profile::services_proxy: require nginx_bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/488400 (owner: 10Giuseppe Lavagetto) [13:25:48] !log Depooling thumbor1004 to check if the rest of our hosts can handle the load without it - T214597 [13:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:52] T214597: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 [13:27:12] 10Operations, 10vm-requests, 10Patch-For-Review: eqiad: (1) Ganeti VM for testing Kerberos in Production - https://phabricator.wikimedia.org/T216238 (10fsero) p:05Triage→03Normal [13:39:55] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] transfer.py: Add the ability to transfer from a new mariabackup [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/486264 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [13:46:07] (03PS1) 10Jcrespo: mariadb: Modify dump_section to allow different types of dump [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491251 (https://phabricator.wikimedia.org/T210292) [13:46:34] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Modify dump_section to allow different types of dump [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491251 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [13:48:49] 10Operations, 10Prod-Kubernetes, 10Documentation, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Next): Update Blubber documentation - https://phabricator.wikimedia.org/T213198 (10LarsWirzenius) https://wikitech.wikimedia.org/wiki/Blubber has been rewritten (by @thcipriani), is there anything... [13:55:24] (03PS1) 10Gehel: elasticsearch: add methods to upgrade elasticsearch and plugins [software/spicerack] - 10https://gerrit.wikimedia.org/r/491254 [13:59:04] (03PS1) 10Gehel: elasticsearch: add cookbook for rolling upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/491255 [13:59:20] !log Drop ep_* tables from s7 - T174802 [13:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:22] T174802: Archive and drop education program (ep_*) tables on all wikis - https://phabricator.wikimedia.org/T174802 [13:59:36] (03PS2) 10Gehel: elasticsearch: add cookbook for rolling upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/491255 (https://phabricator.wikimedia.org/T202885) [13:59:53] (03PS2) 10Gehel: elasticsearch: add methods to upgrade elasticsearch and plugins [software/spicerack] - 10https://gerrit.wikimedia.org/r/491254 (https://phabricator.wikimedia.org/T202885) [14:01:18] (03PS1) 10Jcrespo: mariadb: Update mariadb logical path location [puppet] - 10https://gerrit.wikimedia.org/r/491256 (https://phabricator.wikimedia.org/T210292) [14:01:52] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: add cookbook for rolling upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/491255 (https://phabricator.wikimedia.org/T202885) (owner: 10Gehel) [14:02:18] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Update mariadb logical path location [puppet] - 10https://gerrit.wikimedia.org/r/491256 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [14:04:31] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor comment, rest LGTM" (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/491219 (https://phabricator.wikimedia.org/T216238) (owner: 10Elukey) [14:05:26] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: add methods to upgrade elasticsearch and plugins [software/spicerack] - 10https://gerrit.wikimedia.org/r/491254 (https://phabricator.wikimedia.org/T202885) (owner: 10Gehel) [14:13:26] (03PS3) 10Gehel: elasticsearch: add methods to upgrade elasticsearch and plugins [software/spicerack] - 10https://gerrit.wikimedia.org/r/491254 (https://phabricator.wikimedia.org/T202885) [14:14:44] (03CR) 10Volans: [C: 04-1] "See inline, not sure is the right place, but if this is a blocker I'm keen to accept it as a temporary solution (as long as it's temporary" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/491254 (https://phabricator.wikimedia.org/T202885) (owner: 10Gehel) [14:14:52] (03PS3) 10Gehel: elasticsearch: add cookbook for rolling upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/491255 (https://phabricator.wikimedia.org/T202885) [14:16:46] (03PS2) 10Joal: Update sqoop launchers used by timers [puppet] - 10https://gerrit.wikimedia.org/r/491246 (https://phabricator.wikimedia.org/T205940) [14:17:05] (03PS2) 10Jcrespo: mariadb: Update mariadb logical path location [puppet] - 10https://gerrit.wikimedia.org/r/491256 (https://phabricator.wikimedia.org/T210292) [14:17:10] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: add cookbook for rolling upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/491255 (https://phabricator.wikimedia.org/T202885) (owner: 10Gehel) [14:18:02] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Update mariadb logical path location [puppet] - 10https://gerrit.wikimedia.org/r/491256 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [14:20:37] (03PS4) 10Gehel: elasticsearch: add cookbook for rolling upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/491255 (https://phabricator.wikimedia.org/T202885) [14:22:52] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: add cookbook for rolling upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/491255 (https://phabricator.wikimedia.org/T202885) (owner: 10Gehel) [14:22:55] (03PS3) 10Jcrespo: mariadb: Update mariadb logical path location [puppet] - 10https://gerrit.wikimedia.org/r/491256 (https://phabricator.wikimedia.org/T210292) [14:28:41] (03CR) 10Marostegui: [C: 03+1] mariadb: Update mariadb logical path location [puppet] - 10https://gerrit.wikimedia.org/r/491256 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [14:29:25] !log rebooting mw2167 for kernel tests [14:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:47] (03CR) 10Alexandros Kosiaris: [C: 03+1] service::node: add the 'use_nodejs10' parameter [puppet] - 10https://gerrit.wikimedia.org/r/477475 (https://phabricator.wikimedia.org/T210704) (owner: 10Elukey) [14:32:32] (03PS2) 10Elukey: Allocate IP for kerberos1001.eqiad.wment (Ganeti VM) [dns] - 10https://gerrit.wikimedia.org/r/491219 (https://phabricator.wikimedia.org/T216238) [14:32:35] (03PS4) 10Gehel: elasticsearch: add methods to upgrade elasticsearch and plugins [software/spicerack] - 10https://gerrit.wikimedia.org/r/491254 (https://phabricator.wikimedia.org/T202885) [14:32:40] (03CR) 10Elukey: ">" (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/491219 (https://phabricator.wikimedia.org/T216238) (owner: 10Elukey) [14:35:08] (03PS2) 10Jcrespo: mariadb: Modify dump_section to allow different types of dump [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491251 (https://phabricator.wikimedia.org/T210292) [14:35:42] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] mariadb: Modify dump_section to allow different types of dump [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491251 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [14:38:29] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: add methods to upgrade elasticsearch and plugins [software/spicerack] - 10https://gerrit.wikimedia.org/r/491254 (https://phabricator.wikimedia.org/T202885) (owner: 10Gehel) [14:38:45] (03PS9) 10Jcrespo: mariadb: Modify dump_section to allow different types of dump [puppet] - 10https://gerrit.wikimedia.org/r/475471 (https://phabricator.wikimedia.org/T210292) [14:43:34] (03CR) 10Jcrespo: [C: 03+2] mariadb: Modify dump_section to allow different types of dump [puppet] - 10https://gerrit.wikimedia.org/r/475471 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [14:44:13] 10Operations: Integrate Stretch 9.8 point update - https://phabricator.wikimedia.org/T216384 (10MoritzMuehlenhoff) [14:52:01] (03PS5) 10Gehel: elasticsearch: add cookbook for rolling upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/491255 (https://phabricator.wikimedia.org/T202885) [14:53:16] (03CR) 10Gehel: elasticsearch: add methods to upgrade elasticsearch and plugins (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/491254 (https://phabricator.wikimedia.org/T202885) (owner: 10Gehel) [14:53:28] (03CR) 10Jcrespo: [C: 03+2] mariadb: Update mariadb logical path location [puppet] - 10https://gerrit.wikimedia.org/r/491256 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [14:53:36] (03PS4) 10Jcrespo: mariadb: Update mariadb logical path location [puppet] - 10https://gerrit.wikimedia.org/r/491256 (https://phabricator.wikimedia.org/T210292) [14:55:33] (03CR) 10Mathew.onipe: "some few comments. Will do a more thorough pass in some minutes." (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/491255 (https://phabricator.wikimedia.org/T202885) (owner: 10Gehel) [14:58:54] (03PS6) 10Gehel: elasticsearch: add cookbook for rolling upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/491255 (https://phabricator.wikimedia.org/T202885) [15:01:06] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: add cookbook for rolling upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/491255 (https://phabricator.wikimedia.org/T202885) (owner: 10Gehel) [15:01:38] (03CR) 10Volans: "Looks ok, some nitpick inline." (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/491255 (https://phabricator.wikimedia.org/T202885) (owner: 10Gehel) [15:03:27] PROBLEM - Host stat1005 is DOWN: PING CRITICAL - Packet loss = 100% [15:03:49] RECOVERY - Host stat1005 is UP: PING OK - Packet loss = 0%, RTA = 36.21 ms [15:05:38] (03PS7) 10Gehel: elasticsearch: add cookbook for rolling upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/491255 (https://phabricator.wikimedia.org/T202885) [15:11:26] stat1005 is me, going to add a permanent downtime via puppet [15:17:33] (03PS1) 10Jcrespo: mariadb: Fix dependency typo on backup directory [puppet] - 10https://gerrit.wikimedia.org/r/491262 [15:18:39] (03CR) 10Jcrespo: [C: 03+2] mariadb: Fix dependency typo on backup directory [puppet] - 10https://gerrit.wikimedia.org/r/491262 (owner: 10Jcrespo) [15:21:05] !log move logical backups to subdirectory T210292 [15:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:09] T210292: Implement a proof of concept of a snapshot cycle automation for a mediawiki section database - https://phabricator.wikimedia.org/T210292 [15:22:31] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843 (10elukey) Thanks to Moritz we have buster back on stat1005. I did the following: * Added `radeon.cik_support=0 amdgpu.cik_support=1` to grub.cfg... [15:23:53] (03CR) 10Elukey: [C: 03+2] Allocate IP for kerberos1001.eqiad.wment (Ganeti VM) [dns] - 10https://gerrit.wikimedia.org/r/491219 (https://phabricator.wikimedia.org/T216238) (owner: 10Elukey) [15:28:50] (03CR) 10Gehel: [C: 04-1] "minor comments inline" (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) (owner: 10Mathew.onipe) [15:30:16] 10Operations, 10vm-requests, 10Patch-For-Review: eqiad: (1) Ganeti VM for testing Kerberos in Production - https://phabricator.wikimedia.org/T216238 (10elukey) I had a chat with Moritz about the naming, since kerberos1001 seems a very generic and probably misleading name. This VM will only be used for testin... [15:32:50] (03PS3) 10Bstorm: toolforge: Use a really old version of kubectl for the current k8s [puppet] - 10https://gerrit.wikimedia.org/r/489291 (https://phabricator.wikimedia.org/T215586) [15:34:49] (03PS1) 10Elukey: Disable notifications for stat1005 while testing [puppet] - 10https://gerrit.wikimedia.org/r/491263 (https://phabricator.wikimedia.org/T148843) [15:35:39] (03CR) 10Elukey: [C: 03+2] Disable notifications for stat1005 while testing [puppet] - 10https://gerrit.wikimedia.org/r/491263 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey) [15:35:51] (03CR) 10Bstorm: [C: 03+2] toolforge: Use a really old version of kubectl for the current k8s [puppet] - 10https://gerrit.wikimedia.org/r/489291 (https://phabricator.wikimedia.org/T215586) (owner: 10Bstorm) [15:36:03] (03PS4) 10Bstorm: toolforge: Use a really old version of kubectl for the current k8s [puppet] - 10https://gerrit.wikimedia.org/r/489291 (https://phabricator.wikimedia.org/T215586) [15:37:22] (03CR) 10Fsero: "Thanks for the review!" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/490073 (https://phabricator.wikimedia.org/T214289) (owner: 10Fsero) [15:41:04] !log performing es2 & es3 backups into es2002 [15:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:26] (03PS10) 10Elukey: service::node: add the 'use_nodejs10' parameter [puppet] - 10https://gerrit.wikimedia.org/r/477475 (https://phabricator.wikimedia.org/T210704) [15:42:40] (03CR) 10Elukey: "More hosts: https://puppet-compiler.wmflabs.org/compiler1001/14719/" [puppet] - 10https://gerrit.wikimedia.org/r/477475 (https://phabricator.wikimedia.org/T210704) (owner: 10Elukey) [15:45:05] (03CR) 10Elukey: [C: 03+2] service::node: add the 'use_nodejs10' parameter [puppet] - 10https://gerrit.wikimedia.org/r/477475 (https://phabricator.wikimedia.org/T210704) (owner: 10Elukey) [15:48:17] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843 (10elukey) Very promising: https://github.com/RadeonOpenCompute/ROCm/issues/702#issuecomment-461982554 > As noted in #691 and #640, Hawaii GPUs... [15:51:28] (03CR) 10Alexandros Kosiaris: profile::mediawiki::maintenance: systemd-timer based periodic jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/482792 (https://phabricator.wikimedia.org/T211250) (owner: 10Giuseppe Lavagetto) [15:52:23] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I 've left a comment on the wrong patchset (PS 1) but it still stands. -1 per that comment" [puppet] - 10https://gerrit.wikimedia.org/r/482792 (https://phabricator.wikimedia.org/T211250) (owner: 10Giuseppe Lavagetto) [15:54:04] (03PS1) 10Andrew Bogott: Revert "wmcs services: fixed a few resource paths for maintain_dbusers" [puppet] - 10https://gerrit.wikimedia.org/r/491267 [15:54:06] (03PS1) 10Andrew Bogott: Revert "wmcs services: introduce profile for maintain_dbusers in services nodes" [puppet] - 10https://gerrit.wikimedia.org/r/491268 [15:54:08] (03PS1) 10Andrew Bogott: Revert "nfs-exportd: add the 'exportdir' config value" [puppet] - 10https://gerrit.wikimedia.org/r/491269 [15:54:10] (03PS1) 10Andrew Bogott: Revert "NFS: allow the clouddb-services project to mount tools home and project dirs" [puppet] - 10https://gerrit.wikimedia.org/r/491270 [15:54:17] (03CR) 10Alexandros Kosiaris: WIP: Cron to run script to purge old CX drafts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/486454 (https://phabricator.wikimedia.org/T189091) (owner: 10KartikMistry) [15:54:21] (03CR) 10Andrew Bogott: "this can be abandoned, I think -- I'm going to just revert in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/491268/" [puppet] - 10https://gerrit.wikimedia.org/r/491189 (https://phabricator.wikimedia.org/T216373) (owner: 10Bstorm) [15:54:51] (03PS1) 10Bstorm: tools-bastion: need the checksum type [puppet] - 10https://gerrit.wikimedia.org/r/491271 (https://phabricator.wikimedia.org/T215586) [15:55:12] (03CR) 10Nuria: "Did we do a test run of this scoop?" [puppet] - 10https://gerrit.wikimedia.org/r/491246 (https://phabricator.wikimedia.org/T205940) (owner: 10Joal) [15:56:12] (03CR) 10Bstorm: [C: 03+2] tools-bastion: need the checksum type [puppet] - 10https://gerrit.wikimedia.org/r/491271 (https://phabricator.wikimedia.org/T215586) (owner: 10Bstorm) [16:00:36] (03PS2) 10Bstorm: maintain_dbusers: add the new database VM [puppet] - 10https://gerrit.wikimedia.org/r/491013 (https://phabricator.wikimedia.org/T193264) [16:01:55] (03CR) 10Andrew Bogott: [C: 03+1] maintain_dbusers: add the new database VM [puppet] - 10https://gerrit.wikimedia.org/r/491013 (https://phabricator.wikimedia.org/T193264) (owner: 10Bstorm) [16:03:20] 10Operations, 10vm-requests, 10Patch-For-Review: eqiad: (1) Ganeti VM for testing Kerberos in Production - https://phabricator.wikimedia.org/T216238 (10akosiaris) >>! In T216238#4960997, @elukey wrote: > So after reading https://wikitech.wikimedia.org/wiki/Ganeti#Create_a_VM this is what I'd do: > > 1) Revi... [16:06:12] (03CR) 10BryanDavis: [C: 03+1] maintain_dbusers: add the new database VM [puppet] - 10https://gerrit.wikimedia.org/r/491013 (https://phabricator.wikimedia.org/T193264) (owner: 10Bstorm) [16:07:53] 10Operations, 10Analytics, 10Research, 10serviceops, and 4 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10akosiaris) For the record, just saying pointing out that the question of a new VM versus mwmaint1002 is probably irrelevant here. We c... [16:08:59] (03PS2) 10Andrew Bogott: Revert "wmcs services: fixed a few resource paths for maintain_dbusers" [puppet] - 10https://gerrit.wikimedia.org/r/491267 [16:09:01] (03PS12) 10Mathew.onipe: Add wdqs data transfer cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) [16:09:08] (03PS2) 10Andrew Bogott: Revert "wmcs services: introduce profile for maintain_dbusers in services nodes" [puppet] - 10https://gerrit.wikimedia.org/r/491268 [16:09:13] (03CR) 10Mathew.onipe: Add wdqs data transfer cookbook (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) (owner: 10Mathew.onipe) [16:09:18] (03PS2) 10Andrew Bogott: Revert "nfs-exportd: add the 'exportdir' config value" [puppet] - 10https://gerrit.wikimedia.org/r/491269 [16:09:27] (03PS2) 10Andrew Bogott: Revert "NFS: allow the clouddb-services project to mount tools home and project dirs" [puppet] - 10https://gerrit.wikimedia.org/r/491270 [16:10:12] (03CR) 10Andrew Bogott: [C: 03+2] Revert "wmcs services: fixed a few resource paths for maintain_dbusers" [puppet] - 10https://gerrit.wikimedia.org/r/491267 (owner: 10Andrew Bogott) [16:10:27] (03CR) 10Andrew Bogott: [C: 03+2] Revert "wmcs services: introduce profile for maintain_dbusers in services nodes" [puppet] - 10https://gerrit.wikimedia.org/r/491268 (owner: 10Andrew Bogott) [16:10:36] (03CR) 10Andrew Bogott: [C: 03+2] Revert "nfs-exportd: add the 'exportdir' config value" [puppet] - 10https://gerrit.wikimedia.org/r/491269 (owner: 10Andrew Bogott) [16:10:47] (03CR) 10Andrew Bogott: [C: 03+2] Revert "NFS: allow the clouddb-services project to mount tools home and project dirs" [puppet] - 10https://gerrit.wikimedia.org/r/491270 (owner: 10Andrew Bogott) [16:11:32] (03CR) 10jerkins-bot: [V: 04-1] Add wdqs data transfer cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) (owner: 10Mathew.onipe) [16:17:08] (03PS13) 10Mathew.onipe: Add wdqs data transfer cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) [16:19:12] 10Operations, 10Core Platform Team Backlog (Later), 10Patch-For-Review, 10Services (next): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10bd808) [16:32:02] 10Operations, 10Analytics, 10Wikimedia-Stream, 10Services (watching): Eventstreams build is broken - https://phabricator.wikimedia.org/T216184 (10fdans) [16:32:19] (03PS1) 10Muehlenhoff: Remove stray packages after dist-upgrade on buster [puppet] - 10https://gerrit.wikimedia.org/r/491275 [16:34:37] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install labsdb1012.eqiad.wmnet - https://phabricator.wikimedia.org/T215231 (10Cmjohnson) @elukey, it will affect how it's rack...10G racks have different switches but we are also limited in space for those racks. If 1... [16:36:44] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install labsdb1012.eqiad.wmnet - https://phabricator.wikimedia.org/T215231 (10elukey) >>! In T215231#4961942, @Cmjohnson wrote: > @elukey, it will affect how it's rack...10G racks have different switches but we are also... [16:41:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] maintain_dbusers: add the new database VM [puppet] - 10https://gerrit.wikimedia.org/r/491013 (https://phabricator.wikimedia.org/T193264) (owner: 10Bstorm) [16:41:41] (03PS1) 10Elukey: Deployment-prep: add cassandra/twcs scap repository [puppet] - 10https://gerrit.wikimedia.org/r/491276 (https://phabricator.wikimedia.org/T210706) [16:41:46] 10Operations, 10Analytics, 10RESTBase, 10Traffic, and 2 others: Verify that hit/miss stats in WebRequest are correct - https://phabricator.wikimedia.org/T215987 (10fdans) [16:42:30] 10Operations, 10Analytics, 10RESTBase, 10Traffic, and 2 others: Verify that hit/miss stats in WebRequest are correct - https://phabricator.wikimedia.org/T215987 (10fdans) @BBlack do you have any concerns related to the hit/miss data sent to webrequest? [16:42:50] 10Operations, 10Analytics, 10RESTBase, 10Traffic, and 2 others: Verify that hit/miss stats in WebRequest are correct - https://phabricator.wikimedia.org/T215987 (10fdans) a:05JAllemandou→03None [16:52:48] 10Operations, 10Analytics, 10Analytics-Kanban, 10hardware-requests: GPU upgrade for stat1005 - https://phabricator.wikimedia.org/T216226 (10fdans) [16:52:55] 10Operations, 10Analytics, 10Analytics-Kanban, 10hardware-requests: GPU upgrade for stat1005 - https://phabricator.wikimedia.org/T216226 (10fdans) p:05Normal→03High [16:55:13] jouncebot: next [16:55:13] In 19 hour(s) and 4 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190219T1200) [16:55:40] 19 hours and 4 minutes :D [16:55:52] I should try one late in Thursday [16:56:03] I'm going to test something that will disrupt jouncebot for a short time. looks like that shouldn't cause anyone issues [16:58:55] 10Operations, 10Analytics, 10Analytics-Cluster, 10Traffic: Respect X-Forwarded-For only from trustworthy sources - https://phabricator.wikimedia.org/T56783 (10fdans) [16:59:36] 10Operations, 10Analytics, 10Analytics-Cluster, 10Traffic: Respect X-Forwarded-For only from trustworthy sources - https://phabricator.wikimedia.org/T56783 (10fdans) @BBlack is this task finished? [17:00:18] (03CR) 10Joal: "I manually tested the jobs with manually pasted parameters (parameter-names as defined in the files) - they work :)" [puppet] - 10https://gerrit.wikimedia.org/r/491246 (https://phabricator.wikimedia.org/T205940) (owner: 10Joal) [17:00:36] elukey: --^ :D [17:07:35] ah good! [17:20:34] 10Operations, 10Analytics, 10Research, 10serviceops, and 4 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Nuria) >Should we btw stall this on T213976? yes, we need to resolve first where/how are binarie/data files s going to be moved to the... [17:21:00] (03PS1) 10Effie Mouzeli: Upgrade thumbor1004 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/491277 (https://phabricator.wikimedia.org/T214597) [17:22:36] (03CR) 10Volans: [C: 03+1] "LGTM if you can't do the refactor of execute_on_clusters() to reduce the parameters right now. Up to you." [cookbooks] - 10https://gerrit.wikimedia.org/r/491255 (https://phabricator.wikimedia.org/T202885) (owner: 10Gehel) [17:25:42] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.16 [software/spicerack] - 10https://gerrit.wikimedia.org/r/491278 [17:27:03] 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10jcrespo) We use [[ https://phabricator.wikimedia.org/T156462 | transfer.py ]] to transfer up to 12TB of data for da... [17:28:03] (03CR) 10Effie Mouzeli: [C: 03+2] Upgrade thumbor1004 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/491277 (https://phabricator.wikimedia.org/T214597) (owner: 10Effie Mouzeli) [17:29:43] 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10Nuria) @jcrespo: have in mind that this is not only for data destined to mysql (although this is the particular ca... [17:31:49] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.16 [software/spicerack] - 10https://gerrit.wikimedia.org/r/491278 (owner: 10Volans) [17:34:32] (03PS1) 10Volans: Add tox configuration to run the tests [dns] - 10https://gerrit.wikimedia.org/r/491280 [17:34:43] (03CR) 10jerkins-bot: [V: 04-1] Add tox configuration to run the tests [dns] - 10https://gerrit.wikimedia.org/r/491280 (owner: 10Volans) [17:35:45] 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10jcrespo) transfer.py works for: * Plain files from filesystem to filesystem * Online mysql/mariaDB databases It w... [17:37:39] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.16 [software/spicerack] - 10https://gerrit.wikimedia.org/r/491278 (owner: 10Volans) [17:38:27] 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10Nuria) @jcrespo seems something worth considering, I leave up to @fgiunchedi @Ottomata and @akosiaris to see if tra... [17:38:39] (03CR) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.16 [software/spicerack] - 10https://gerrit.wikimedia.org/r/491278 (owner: 10Volans) [17:40:24] (03PS1) 10Elukey: aqs: add the possibily to deploy nodejs 10 [puppet] - 10https://gerrit.wikimedia.org/r/491282 (https://phabricator.wikimedia.org/T210706) [17:41:19] (03CR) 10Volans: "I've given a try to the tox migration. I've removed the run-tests.sh script as I think the only usage is in CI, but let me know if you hav" (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/491280 (owner: 10Volans) [17:42:57] (03PS1) 10Volans: Upstream release v0.0.16 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/491283 [17:44:27] (03CR) 10Volans: "@bblack: sorry, I failed to add you when adding my previous comment, please see:" [dns] - 10https://gerrit.wikimedia.org/r/491280 (owner: 10Volans) [17:49:35] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.16 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/491283 (owner: 10Volans) [17:49:54] !log Reimaging thumbor1004 to stretc - T214597 [17:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:58] T214597: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 [17:49:59] !log Reimaging thumbor1004 to stretch - T214597 [17:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:04] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['thumbor1004.eqiad.wmnet'] ` The log can be foun... [17:53:09] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/14722/aqs1004.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/491282 (https://phabricator.wikimedia.org/T210706) (owner: 10Elukey) [17:53:36] !log set clouddb1001 in read_only=1 [17:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:16] ^arturo, bstorm_ [17:54:20] 👋🏻 [17:54:35] o/ [17:54:40] (03CR) 10BBlack: "Can we break this up and get a matching commit for the integration-config repo that smooths it all over? By that I mean: (1) Just the mino" [dns] - 10https://gerrit.wikimedia.org/r/491280 (owner: 10Volans) [17:55:06] ok, so we log and set the original master in read only, then we do the dns deploy [17:55:17] then we set clouddb in read only=0 [17:55:21] check connections come in [17:55:22] (03Merged) 10jenkins-bot: Upstream release v0.0.16 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/491283 (owner: 10Volans) [17:55:32] well, the last 2 in reverse [17:56:05] !log restarting labsdb1004 [17:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:19] we need that catched up, and we will repoint it on the read only coord [17:56:41] sadly, there is an io error [17:56:58] so we may have a firewall/vlan problem [17:57:34] not a huge issue if it can connect to the new server [17:57:39] what is the connection? [17:57:49] labsdb1004 --> labsdb1005 ? [17:57:52] We should be able to connect to the new server [17:58:01] arturo worked to get that opened already :) [17:58:01] the other way around [17:58:11] ok, we can deal with that later [17:58:16] ok [17:58:37] replicationg from clouddb1001 may be preferredf [17:58:55] we have worse ongoig issues [17:59:32] clouddb1001 should be able to connect to both on 3306/tcp, from firewalling point of view [17:59:34] so I will set toolsdb in read_only when the dns change is prepared [17:59:48] Ok, so the announcement is out [17:59:48] yeah, but we need the opposite [18:00:00] The opposite? [18:00:01] replica needs connectiong to any master [18:00:11] Ah yes [18:00:57] It's weird that it cannot reach labsdb1005...wonder if a mistake happened in the network changes. Oh well, if it can connect to clouddb1001, that's where we want to end up eventually [18:01:05] Until we can get clouddb1002 up [18:01:33] arturo if you are ready with DNS, I think we should go ahead and go readonly now [18:01:35] wait a bit [18:01:39] there is some grant issues [18:01:40] Ok! [18:01:42] Ahhh [18:01:54] yeah... repl@'10.%' right? [18:02:05] I changed that on the master...so it may be my fault [18:02:29] no, I am not worried about that right now [18:02:43] I did `UPDATE mysql.user SET Host='172.16.%' WHERE Host='10.%' AND user='repl';` [18:02:44] (03PS4) 10Arturo Borrero Gonzalez: cloudvps: refresh FQDN A record for tools.db.svc.eqiad.wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/491005 (https://phabricator.wikimedia.org/T193264) [18:02:45] So you know [18:03:01] so try to avoid update [18:03:12] but if you, remember to do flush privileges [18:03:24] I didn't remember, but bd808 reminded me :) [18:03:30] update can have all sort of issues, specially compatibility between versions [18:03:37] Yeah [18:03:57] I did update the tables on clouddb1001 [18:04:03] Once I had it up [18:05:14] ok, ready to set things in read only when you tell me [18:05:45] (03CR) 10Bstorm: [C: 03+1] cloudvps: refresh FQDN A record for tools.db.svc.eqiad.wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/491005 (https://phabricator.wikimedia.org/T193264) (owner: 10Arturo Borrero Gonzalez) [18:05:51] arturo: lgtm [18:05:57] cool [18:05:57] I say go ahead jynus. [18:05:58] technically we have a script for a switchover, but I don't want to try it on a non-production host [18:06:04] ok [18:06:08] yeah, that may go badly [18:07:21] oh, nice, even that command breaks [18:07:41] I'm ready for merge, when you are ready [18:07:41] ick [18:07:42] I'd say I kill it and configure it to start in read_only [18:07:52] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/491005/ [18:07:56] Yes, we've had that happen already several times [18:08:00] (merge and run the command) [18:08:40] !log disabled puppet and edited my.cnf on labsdb1005 [18:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:45] Thank you! [18:08:48] !log killing mysql on labsdb1005 [18:08:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:45] we can change the dns any time now [18:09:57] PROBLEM - exim queue on mx1001 is CRITICAL: CRITICAL: 3768 mails in exim queue. [18:10:00] it is running recovery [18:10:07] Cool [18:10:27] arturo: all set? [18:10:40] bstorm_: all set, on your command [18:10:48] Let's do it! [18:10:52] ok, doing it [18:11:05] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudvps: refresh FQDN A record for tools.db.svc.eqiad.wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/491005 (https://phabricator.wikimedia.org/T193264) (owner: 10Arturo Borrero Gonzalez) [18:11:06] I already tested connections from toolforge so it *should* be good to go [18:11:11] so that should be log.246849:336 [18:11:15] on labsdb1005 [18:11:36] or log.165630:26478982 on clouddb1001 [18:11:52] Ok [18:12:11] things to remember for future replicas and coordination [18:12:25] jouncebot: next [18:12:25] In 17 hour(s) and 47 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190219T1200) [18:12:43] !log uploaded spicerack_0.0.16-1_amd64.deb to apt.wikimedia.org stretch-wikimedia [18:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:59] is dns deployed, not sure what is the status? [18:13:06] lemme see how it looks. [18:13:30] the script run already [18:13:30] no user on the new db [18:13:32] ok [18:13:36] so WIP? [18:13:54] let me check, it should be done [18:14:12] ask someone to run a command to connect to toolsdb [18:14:16] https://www.irccloud.com/pastebin/C1I5Dftc/ [18:14:25] !log upgraded to spicerack 0.0.16-1 cumin[12]001 [18:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:29] gehel: ^^^ [18:14:44] yeah, I expect some delay [18:14:54] DNS update is not quick :) [18:14:57] mmmm [18:15:02] Wonder what the TTL is [18:15:03] let me check raw designate [18:15:21] read_only | ON is on labsdb1005 [18:15:26] so that is good [18:15:27] lol [18:15:27] Cool [18:15:32] the script isn't working [18:15:40] Oh? That's not ideal [18:15:51] can it be updated manually? [18:16:07] The script adds things for the replicas reliably [18:16:11] I can also set the master back in rw [18:16:22] waiting for your feedback [18:16:40] I'm not too worried about setting it back yet because of the bad state of it anyway [18:16:47] checking some things [18:17:02] well, now it is the 20 minutes it "works" :-) [18:17:51] everthing is in read only, so anyrollback is easy [18:17:58] wait, I didn't run puppet first -_- [18:18:04] the difficult part was if there was a split brain [18:18:38] (03PS2) 10Volans: Add tox configuration to run the tests [dns] - 10https://gerrit.wikimedia.org/r/491280 [18:18:40] (03PS1) 10Volans: Removed run-tests.sh script [dns] - 10https://gerrit.wikimedia.org/r/491286 [18:18:49] (03CR) 10jerkins-bot: [V: 04-1] Removed run-tests.sh script [dns] - 10https://gerrit.wikimedia.org/r/491286 (owner: 10Volans) [18:19:22] jynus: bstorm_ is changed now [18:19:35] Yay! [18:19:39] now it's a matter of cache expiration [18:19:43] I see it [18:19:46] but the right A record is in the DNS [18:19:51] It's working in tools already [18:19:54] "ERROR 1045 (28000): Access denied for user 'bd808'@'172.16.7.167' (using password: NO)" [18:20:02] Not so yay! [18:20:04] ok, is that expected? [18:20:12] I was able to get in yesterday [18:20:13] are the grants on toolsdb locked to 10.*? [18:20:22] They shouldn't be no [18:20:25] They aren't [18:20:28] oh wait... using password NO? [18:20:40] * bd808 looks at the `sql` helper script [18:21:07] no, of course it is not 10., becaue 10. would be production [18:21:11] false alarm folks. I typed `mysql toolsdb` rather than `sql toolsdb` [18:21:18] bd808: `sql tools` right? [18:21:18] only 10. used to be replication [18:21:19] It's working [18:21:25] yes, it is working [18:21:30] YAY! [18:21:35] ok, if you are ok, even if dns is not applying to all [18:21:37] Whew. You scared me [18:21:42] I am requesting to set it in rw [18:21:50] Cool [18:21:52] that is almost a no-return thing [18:22:08] Should I dump out the non-replicated tables then? [18:22:15] wait [18:22:16] We could probably move them over [18:22:18] Ok [18:22:22] we will stop replication [18:22:26] and set it in rw [18:22:30] Ok [18:22:36] I will be doing that [18:22:40] so no 2 hands at the same time [18:22:41] Great [18:22:45] Definitely [18:22:51] ok, doing then now, so [18:23:13] I can see some accounts at least idling [18:23:44] remember log.246849:336 for the old master [18:24:02] log.165630:26478982 for the new one [18:24:17] Put that in a text file [18:24:29] I did that is [18:24:32] technically I already did with the public logging here :-) [18:24:39] True :-D [18:25:37] stop slave; reset slave all;set global replicate_wild_ignore_table=''; [18:26:00] !log setting clouddb1001 in read_write mode [18:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:39] writes are already going through [18:26:47] (03CR) 10Volans: "> Patch Set 1:" [dns] - 10https://gerrit.wikimedia.org/r/491280 (owner: 10Volans) [18:26:53] Great! [18:27:09] now, we can kill idle connections on the old server [18:27:18] assuming ttl expired [18:27:20] (03CR) 10Volans: "This depends on I4e57a749badceb8283c4d7d012a6a6b70d9e2002 that must be merged first and the new image deployed to CI." [dns] - 10https://gerrit.wikimedia.org/r/491286 (owner: 10Volans) [18:28:02] done [18:28:13] Cool [18:28:17] only 3 accounts reconnected [18:28:31] to the old one, I meant [18:28:36] Wonderful [18:28:56] but they are idle, most likely connection pools [18:29:11] Yeah, sounds like it [18:29:34] So does it seem reasonable to try a mysqldump on the non-replicated tables and try to find a way to move them over? [18:29:41] wait [18:29:48] as in, something to try, yes [18:30:11] but I would say to create a ticket- I know some of those are scratch data [18:30:21] Fair enough. [18:30:22] so they will prefer to recreate from 0 [18:30:31] That would certainly be nice [18:30:34] other may be difficult based on the data dictionary issues [18:30:48] Ok I'll make that ticket now. [18:30:50] so we can try, but we will have to do it on a case by case bases [18:31:04] note it was 4 accounts out of thousands [18:31:17] I guess the last thing then is trying to ensure that maintain-dbusers works with the new settings. Merging that [18:31:22] and they were warned we will not support redundancy on those, so a best effort will be done [18:31:28] So I'll do that first, then make that ticket [18:31:33] there is a high chance data will be corrupted [18:31:34] True [18:31:44] In fact, I believe I have a saved log that one is [18:32:40] (03PS3) 10Bstorm: maintain_dbusers: add the new database VM [puppet] - 10https://gerrit.wikimedia.org/r/491013 (https://phabricator.wikimedia.org/T193264) [18:32:49] s51412__data,s51071__templatetiger_p, s52721__pagecount_stats_p and s51290__dpl_p [18:33:21] So I'll try to get the maintain_dbusers whatnot set up and then it's all ensuring this is replicated and puppetized as the master (which we can do via the role). Should I apply that role to this vm now? [18:33:27] my main concern is if the issues were host based (hw or corruption) [18:33:33] or if they will reappear [18:33:44] Yes...that as well [18:33:54] we had HW issues [18:34:09] sure, I don't disagree [18:34:21] but I think it was mostly data dictionary issues [18:34:35] which of course they were related- a crash cause the second [18:34:46] also an older version [18:34:58] jynus, I'll apply this role to clouddb1001 now: modules/profile/manifests/wmcs/services/toolsdb_primary.pp [18:35:10] please do [18:35:10] well, the profile made out of the role [18:35:55] this will definitely need lots of tuning [18:36:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] maintain_dbusers: add the new database VM [puppet] - 10https://gerrit.wikimedia.org/r/491013 (https://phabricator.wikimedia.org/T193264) (owner: 10Bstorm) [18:36:27] but I would like to see people saying it works after an application restart [18:36:36] Same here [18:36:46] So...to the cloud channel, I guess [18:39:45] jynus: so far so good... [18:39:54] trying more things [18:39:56] as in, no complains, or people saying it works? [18:40:09] saying it is working for now. I can send an announcement [18:40:12] That will get more people trying [18:40:24] I won't have a lot of visibility sadly [18:40:34] due to lack of monitoring [18:41:42] buffer pool usage is still low [18:42:08] PROBLEM - Host stat1005 is DOWN: PING CRITICAL - Packet loss = 100% [18:42:19] this is me --^ [18:42:40] we will need to improve our monitoring workflow, but that is not critical right now [18:43:06] Yeah [18:43:26] should I try to conenct labsdb1004 to clouddb1001 directly? [18:43:32] We have a cloud prometheus thing https://grafana-labs.wikimedia.org/dashboard/db/labs-project-board?orgId=1&var-project=clouddb-services&var-server=clouddb1001 [18:43:34] RECOVERY - Host stat1005 is UP: PING OK - Packet loss = 0%, RTA = 36.21 ms [18:43:38] Please! [18:43:57] I think that would be the ideal until we get the virtualized secondary server up [18:44:25] I need some help [18:44:34] I am guessing you copied existing binlogs? [18:45:03] do you remember approximately the time you started the new instance [18:45:17] and the coordinate when shut down [18:45:55] Feb 17 21:11 UTC maybe? [18:45:57] bstorm_: ^ do you have !logs for that stuff? [18:46:23] Yesss... [18:46:28] One sec lemme look [18:46:29] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843 (10elukey) Tried to purge rocm-dev 2.1 and install 1.9.2, same problem: ` [ 90.690958] BUG: unable to handle kernel NULL pointer dereference at... [18:47:27] 2/17 21:21 bstorm_: The slave of labsdb1005.eqiad.wmnet is now clouddb1001.clouddb-services.eqiad.wmflabs [18:47:34] jynus: ^^ [18:47:47] That's from SAL, so a bit before I'm sure [18:47:49] cool [18:47:55] yeah, that fits [18:47:59] that helps me [18:48:03] Great [18:49:06] PROBLEM - Host stat1005 is DOWN: PING CRITICAL - Packet loss = 100% [18:50:03] oh, it is actually easy because they have cloned mysql binlogs [18:51:30] cool :) [18:51:52] now, we won't have dns on production... so I am guessing ip will be needed [18:51:58] RECOVERY - Host stat1005 is UP: PING OK - Packet loss = 0%, RTA = 36.16 ms [18:52:11] that is something to look at later [18:52:24] Yeah [18:52:28] * bstorm_ brb [18:55:10] so I will need a new account created from the production host [18:55:12] ip [18:56:00] 172.16.7.153 jynus ? [18:56:07] no, the old ip range [18:56:19] as brook already change that to the "right one" [18:56:29] I duplicated it for now [18:56:42] we should later do some auditing [18:56:50] I'm sure! [18:56:53] once everthing is on the same network [18:57:01] Yeah. [18:57:03] io is connected [18:57:10] So I can confirm that our tooling is working on the new server [18:57:13] I will start sql now, which will aso be irreversible [18:57:27] with replication filters [18:57:56] replication flowing, Seconds_Behind_Master: 258756 [18:58:07] looks good so far [18:58:17] Great!!!! [18:58:31] I'll send an announce that things are working and to report issues. [18:58:51] mention those 4 dbs that are excluded [18:58:54] there is a ticket [18:59:16] Oh yes, ticket then announce! [18:59:33] jynus: you are great. Thanks for your help today. Also, marostegui earlier today. [18:59:46] I did nothing, you did [18:59:53] I mean, literally [19:02:12] Thank you very much!! [19:03:35] jynus: if nothing else you helped us feel confident that if we messed things up you could help us fix them ;) [19:08:16] (03PS1) 10GTirloni: toolschecker: Replace labsdb1005 with clouddb1001 [puppet] - 10https://gerrit.wikimedia.org/r/491290 (https://phabricator.wikimedia.org/T193264) [19:09:00] (03PS1) 10Dbarratt: Enable partial blocks on Meta Wiki and MediaWiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491291 (https://phabricator.wikimedia.org/T216065) [19:11:19] (03CR) 10GTirloni: [C: 03+2] toolschecker: Replace labsdb1005 with clouddb1001 [puppet] - 10https://gerrit.wikimedia.org/r/491290 (https://phabricator.wikimedia.org/T193264) (owner: 10GTirloni) [19:13:00] jynus marostegui: thanks a lot for all the help these past few days, you rock! [19:14:34] #wikilove [19:17:12] <3 [19:30:59] (03PS1) 10Jcrespo: mariadb-backups: Fix bug when trying the default type [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491293 (https://phabricator.wikimedia.org/T210292) [19:31:34] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backups: Fix bug when trying the default type [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491293 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [19:33:11] legoktm - hi, we have a stuck global rename >3 hours, could you please take a look when you got a minute? tnx much [19:37:37] (03PS2) 10Jcrespo: mariadb-backups: Fix bug when trying the default type [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491293 (https://phabricator.wikimedia.org/T210292) [19:38:02] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backups: Fix bug when trying the default type [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491293 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [19:38:04] (03PS1) 10GTirloni: toolsdb: Enable monitoring [puppet] - 10https://gerrit.wikimedia.org/r/491294 (https://phabricator.wikimedia.org/T193264) [19:38:48] (03CR) 10jerkins-bot: [V: 04-1] toolsdb: Enable monitoring [puppet] - 10https://gerrit.wikimedia.org/r/491294 (https://phabricator.wikimedia.org/T193264) (owner: 10GTirloni) [19:41:54] (03PS1) 10Jcrespo: mariadb-backups: Fix bug when trying the default type [puppet] - 10https://gerrit.wikimedia.org/r/491295 (https://phabricator.wikimedia.org/T210292) [19:44:29] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] mariadb-backups: Fix bug when trying the default type [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491293 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [19:44:39] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Fix bug when trying the default type [puppet] - 10https://gerrit.wikimedia.org/r/491295 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [19:48:27] 10Operations, 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 (10Andrew) > #2 is almost certainly they way to go, as it avoids the weird chicken-egg issue of "we need a labs > puppetmaster to build... [19:54:40] (03PS1) 10GTirloni: toolsdb: Point tools-db.eqiad.wmflabs to clouddb1001 [puppet] - 10https://gerrit.wikimedia.org/r/491296 (https://phabricator.wikimedia.org/T193264) [19:54:59] (03PS1) 10DCausse: Plugins for elasticsearch 6.5.4 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/491297 (https://phabricator.wikimedia.org/T199791) [19:55:08] (03Abandoned) 10DCausse: [WIP] Add nori korean analyzer [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/486266 (https://phabricator.wikimedia.org/T206874) (owner: 10DCausse) [19:55:17] (03Abandoned) 10DCausse: [WIP] Upgrade to 6.5.4 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/446869 (https://phabricator.wikimedia.org/T199791) (owner: 10DCausse) [19:55:35] (03CR) 10Gehel: elasticsearch: add cookbook for rolling upgrade (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/491255 (https://phabricator.wikimedia.org/T202885) (owner: 10Gehel) [19:55:44] (03CR) 10Gehel: [C: 03+2] elasticsearch: add cookbook for rolling upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/491255 (https://phabricator.wikimedia.org/T202885) (owner: 10Gehel) [19:57:43] (03PS2) 10GTirloni: toolsdb: Enable monitoring [puppet] - 10https://gerrit.wikimedia.org/r/491294 (https://phabricator.wikimedia.org/T193264) [19:58:20] (03CR) 10GTirloni: [C: 03+2] toolsdb: Point tools-db.eqiad.wmflabs to clouddb1001 [puppet] - 10https://gerrit.wikimedia.org/r/491296 (https://phabricator.wikimedia.org/T193264) (owner: 10GTirloni) [19:58:30] (03PS2) 10DCausse: Plugins for elasticsearch 6.5.4 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/491297 (https://phabricator.wikimedia.org/T199791) [20:04:21] 10Operations, 10serviceops, 10Security: User ziraksima@gmail is receiving too many emails - https://phabricator.wikimedia.org/T216445 (10jijiki) [20:06:37] *sigh* anyway [20:11:59] ACKNOWLEDGEMENT - exim queue on mx1001 is CRITICAL: CRITICAL: 3229 mails in exim queue. Effie Mouzeli Most mails are towards the same user T216445 [20:15:42] (03PS3) 10GTirloni: toolsdb: Enable monitoring [puppet] - 10https://gerrit.wikimedia.org/r/491294 (https://phabricator.wikimedia.org/T193264) [20:16:36] (03CR) 10GTirloni: [C: 03+2] toolsdb: Enable monitoring [puppet] - 10https://gerrit.wikimedia.org/r/491294 (https://phabricator.wikimedia.org/T193264) (owner: 10GTirloni) [20:16:50] (03PS3) 10DCausse: Plugins for elasticsearch 6.5.4 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/491297 (https://phabricator.wikimedia.org/T199791) [20:19:26] !log icinga2001 ran puppet ahead of schedule (enable tools-checker-toolsdb monitor) [20:19:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:55] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['thumbor1004.eqiad.wmnet'] ` Of which those **FAILED**: ` ['thumbor1004.eqiad.wmnet'] ` [21:42:14] (03PS1) 10Framawiki: quarry: Setup CSP http header [puppet] - 10https://gerrit.wikimedia.org/r/491377 (https://phabricator.wikimedia.org/T214637) [21:43:48] (03CR) 10Zhuyifei1999: "I think it would be better to put in in server {} block." [puppet] - 10https://gerrit.wikimedia.org/r/491377 (https://phabricator.wikimedia.org/T214637) (owner: 10Framawiki) [21:45:20] (03CR) 10Framawiki: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/491377 (https://phabricator.wikimedia.org/T214637) (owner: 10Framawiki) [21:45:53] (03PS2) 10Framawiki: quarry: Setup CSP http header [puppet] - 10https://gerrit.wikimedia.org/r/491377 (https://phabricator.wikimedia.org/T214637) [21:47:11] PROBLEM - MariaDB read only staging on dbstore1005 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.1.37-MariaDB, Uptime 646582s, 10.31 QPS, connection latency: 0.001711s [21:53:14] (03CR) 10Zhuyifei1999: [C: 03+1] quarry: Setup CSP http header [puppet] - 10https://gerrit.wikimedia.org/r/491377 (https://phabricator.wikimedia.org/T214637) (owner: 10Framawiki) [22:11:31] RECOVERY - exim queue on mx1001 is OK: OK: Less than 1000 mails in exim queue. [22:14:20] (03PS3) 10Framawiki: quarry: Setup CSP http header [puppet] - 10https://gerrit.wikimedia.org/r/491377 (https://phabricator.wikimedia.org/T214637) [22:17:02] (03CR) 10Zhuyifei1999: [C: 03+1] quarry: Setup CSP http header [puppet] - 10https://gerrit.wikimedia.org/r/491377 (https://phabricator.wikimedia.org/T214637) (owner: 10Framawiki) [22:26:55] PROBLEM - Memory correctable errors -EDAC- on thumbor1004 is CRITICAL: 17 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [22:29:33] PROBLEM - puppet last run on thumbor1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 8 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[3d2png/deploy] [22:36:55] hi tgr - would you like to help me out with a stuck global rename? [22:37:19] hauskatze: yeah, I was getting around to that [22:37:28] ah, good [22:37:34] PROBLEM - MariaDB Slave Lag: s1 on db1083 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 743.91 seconds [22:38:16] tgr: one thing I don't like about the script is that it messes with the registration dates displayed on Special:CentralAuth, but I'm not sure if that happens with the --ignorestatus parameter only or in general [22:39:42] hauskatze: in general, the dates are stored in memory, if the script dies, they are lost. We have an open task about it somewhere. [22:40:51] I think I remember something, yes [22:41:00] * volans looking ad the db alert [22:41:17] jynus: marostegui you around? [22:45:35] Amir1: running a script on mwmaint1002? [22:45:59] volans: no [22:46:08] is it like it? [22:46:17] I hope my account is not compromised [22:46:27] no, wrong amir ;) [22:46:45] I'm "ladsgroup" LDAP [22:46:48] PROBLEM - EDAC syslog messages on thumbor1004 is CRITICAL: 13 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [22:46:51] oh okay [22:47:24] volans: I can contact the other Amir [22:47:50] Amir1: thanks, alreadypinged him, thanks and sorry to bother [22:48:42] :) [22:48:45] we have 2 amirs on wmf? :) [22:49:27] One is WMDE and one WMF. We sorta have a channel to relay miscommunicated messages :D [22:49:59] RECOVERY - MariaDB Slave Lag: s1 on db1083 is OK: OK slave_sql_lag Replication lag: 0.21 seconds [22:50:15] I know you --WMDE-- and Amir E. A. [22:50:21] (It happens all the time) I was once told I played piano beautifully while I never touched a piano in my life. I was flattered though. [22:50:54] lol [22:50:58] lol heh [22:54:25] I should post-it that mwmaint1001 no longer exists [22:54:33] I keep writting it bad [23:42:15] (03CR) 10BryanDavis: [C: 04-1] "Unneeded after the I83868090cdc194c6f4f32088659f5c5d7eb52d94 revert" [puppet] - 10https://gerrit.wikimedia.org/r/491189 (https://phabricator.wikimedia.org/T216373) (owner: 10Bstorm)