[00:01:51] 10Operations, 10MobileFrontend, 10TechCom-RFC, 10Traffic, 10Readers-Web-Backlog (Tracking): Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Koavf) Sounds like a bug. [00:04:06] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:15:29] (03PS1) 10CDanis: grafana: also install sqlite3-pcre [puppet] - 10https://gerrit.wikimedia.org/r/520147 [00:17:08] (03CR) 10CDanis: [C: 03+2] grafana: also install sqlite3-pcre [puppet] - 10https://gerrit.wikimedia.org/r/520147 (owner: 10CDanis) [00:30:58] PROBLEM - puppet last run on labmon1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[sqlite3-pcre] [00:32:44] PROBLEM - puppet last run on labmon1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[sqlite3-pcre] [00:43:46] PROBLEM - puppet last run on ms-be1042 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [00:47:42] PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 875.97 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [01:11:02] RECOVERY - puppet last run on ms-be1042 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [01:21:41] !log milimetric@deploy1001 Started deploy [analytics/refinery@b8a496b]: fix private sqoop [01:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:29:04] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:29:06] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=LIST https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:33:32] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [01:34:52] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:35:04] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:35:44] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=PUT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:35:46] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation={compareAndSwap,get} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:36:26] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [01:37:14] PROBLEM - Disk space on notebook1003 is CRITICAL: DISK CRITICAL - free space: /srv 4334 MB (3% inode=86%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [01:39:17] !log milimetric@deploy1001 Finished deploy [analytics/refinery@b8a496b]: fix private sqoop (duration: 17m 36s) [01:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:40:10] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:40:12] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:40:48] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:40:58] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:42:18] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:42:20] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:44:20] PROBLEM - puppet last run on lvs5003 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [01:55:05] 10Operations, 10Wikimedia-Site-requests: Global rename of Waldir → Waldyrious: supervision needed - https://phabricator.wikimedia.org/T225370 (101997kB) @waldyrious You need to file another task for that. See T171417 for an example. [02:07:12] RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.42 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [02:11:38] RECOVERY - puppet last run on lvs5003 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [02:40:17] 10Operations, 10SRE-Access-Requests: restore shell access for dzahn - https://phabricator.wikimedia.org/T227052 (10Dzahn) [03:02:13] (03PS3) 10Vgutierrez: acme_chief: Enforce staging time validation [software/acme-chief] - 10https://gerrit.wikimedia.org/r/517605 (https://phabricator.wikimedia.org/T225945) [03:02:51] (03CR) 10Vgutierrez: "Thanks for the review volans :)" (034 comments) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/517605 (https://phabricator.wikimedia.org/T225945) (owner: 10Vgutierrez) [03:20:32] (03PS1) 10Dzahn: admins: restore shell access for myself, dzahn [puppet] - 10https://gerrit.wikimedia.org/r/520157 (https://phabricator.wikimedia.org/T227052) [03:24:35] (03CR) 10JJMC89: "Is there a reason some permissions are set in InitialiseSettings vs abusefilter? Should they be consolidated into one?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 (owner: 10Daimona Eaytoy) [03:39:26] (03PS12) 10Vgutierrez: ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) [03:53:04] PROBLEM - snapshot of s6 in codfw on db1115 is CRITICAL: snapshot for s6 at codfw taken more than 4 days ago: Most recent backup 2019-06-28 03:33:36 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [04:01:41] (03PS13) 10Vgutierrez: ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) [04:02:06] (03PS1) 10CDanis: lvs: alert on excessive network rx traffic [puppet] - 10https://gerrit.wikimedia.org/r/520160 [04:02:25] (03CR) 10jerkins-bot: [V: 04-1] ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [04:02:55] (03CR) 10jerkins-bot: [V: 04-1] lvs: alert on excessive network rx traffic [puppet] - 10https://gerrit.wikimedia.org/r/520160 (owner: 10CDanis) [04:03:34] (03PS14) 10Vgutierrez: ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) [04:04:24] (03PS2) 10CDanis: lvs: alert on excessive network rx traffic [puppet] - 10https://gerrit.wikimedia.org/r/520160 [04:05:13] (03CR) 10jerkins-bot: [V: 04-1] lvs: alert on excessive network rx traffic [puppet] - 10https://gerrit.wikimedia.org/r/520160 (owner: 10CDanis) [04:06:12] (03PS3) 10CDanis: lvs: alert on excessive network rx traffic [puppet] - 10https://gerrit.wikimedia.org/r/520160 [04:11:58] (03CR) 10CDanis: "PCC looks good: https://puppet-compiler.wmflabs.org/compiler1002/17180/" [puppet] - 10https://gerrit.wikimedia.org/r/520160 (owner: 10CDanis) [04:14:09] cdanis: I don't know how picky is XioNoX regarding units.. but maybe you could throw a /8 and use mbps instead of MBps [04:14:26] I think pa.ravoid will probably ask me the same thing so I should do that [04:14:49] O:) [04:15:03] what? [04:15:04] pre-emptive code review [04:15:46] XioNoX: somehow I'm used to megabits per second instead of megabytes per second when I'm in "network mode" [04:16:08] ah [04:16:20] (03PS4) 10CDanis: lvs: alert on excessive network rx traffic [puppet] - 10https://gerrit.wikimedia.org/r/520160 [04:31:38] (03PS15) 10Vgutierrez: ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) [04:41:46] (03CR) 10ArielGlenn: [C: 03+1] "verified that the key is different, and that this restores all group memberships removed in Iafa0c092a25ddb19b2e577879f1ae73c10317263" [puppet] - 10https://gerrit.wikimedia.org/r/520157 (https://phabricator.wikimedia.org/T227052) (owner: 10Dzahn) [04:50:19] (03PS16) 10Vgutierrez: ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) [04:50:45] (03CR) 10jerkins-bot: [V: 04-1] ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [04:52:22] (03PS6) 10Vgutierrez: acme_chief: Introduce the concept of shared certificates [puppet] - 10https://gerrit.wikimedia.org/r/517660 (https://phabricator.wikimedia.org/T133548) [04:52:25] (03PS17) 10Vgutierrez: ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) [04:52:56] (03CR) 10jerkins-bot: [V: 04-1] ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [04:53:32] (03PS18) 10Vgutierrez: ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) [04:53:38] PROBLEM - Check systemd state on ms-be1024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:09:49] (03PS19) 10Vgutierrez: ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) [05:10:16] (03CR) 10jerkins-bot: [V: 04-1] ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [05:11:08] (03PS20) 10Vgutierrez: ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) [05:19:14] (03PS1) 10Marostegui: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520167 [05:20:38] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520167 (owner: 10Marostegui) [05:21:54] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520167 (owner: 10Marostegui) [05:22:13] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520167 (owner: 10Marostegui) [05:23:14] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1092 (duration: 00m 54s) [05:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:27] !log Upgrade MySQL and kernel on db1092 [05:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:09] 10Operations, 10serviceops, 10Core Platform Team Backlog (Later), 10Services (next): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10KartikMistry) [05:33:20] RECOVERY - Check systemd state on ms-be1024 is OK: OK - running: The system is fully operational [05:34:46] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520168 [05:35:53] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520168 (owner: 10Marostegui) [05:36:44] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520168 (owner: 10Marostegui) [05:37:00] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520168 (owner: 10Marostegui) [05:37:48] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1092 (duration: 00m 48s) [05:37:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:07] (03PS1) 10Marostegui: db-eqiad.php: Give some API traffic to db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520170 [05:46:11] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Give some API traffic to db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520170 (owner: 10Marostegui) [05:47:04] (03Merged) 10jenkins-bot: db-eqiad.php: Give some API traffic to db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520170 (owner: 10Marostegui) [05:47:19] (03CR) 10jenkins-bot: db-eqiad.php: Give some API traffic to db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520170 (owner: 10Marostegui) [05:48:04] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1092 into API (duration: 00m 49s) [05:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:26] PROBLEM - snapshot of s3 in codfw on db1115 is CRITICAL: snapshot for s3 at codfw taken more than 4 days ago: Most recent backup 2019-06-28 05:31:32 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [06:01:40] (03PS1) 10Aaron Schulz: Raise HHVM mysql query time threshold to effectively not trigger [puppet] - 10https://gerrit.wikimedia.org/r/520172 (https://phabricator.wikimedia.org/T216243) [06:03:45] (03CR) 10Giuseppe Lavagetto: [C: 03+2] systemd::timer: use OS facts in tests [puppet] - 10https://gerrit.wikimedia.org/r/520034 (owner: 10Giuseppe Lavagetto) [06:05:39] (03CR) 10Giuseppe Lavagetto: furl: support connecting to unix sockets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/520025 (owner: 10Giuseppe Lavagetto) [06:11:39] (03PS3) 10Giuseppe Lavagetto: furl: support connecting to unix sockets [puppet] - 10https://gerrit.wikimedia.org/r/520025 [06:11:41] (03PS2) 10Giuseppe Lavagetto: systemd::timer: use OS facts in tests [puppet] - 10https://gerrit.wikimedia.org/r/520034 [06:13:39] (03CR) 10Giuseppe Lavagetto: [C: 03+2] furl: support connecting to unix sockets [puppet] - 10https://gerrit.wikimedia.org/r/520025 (owner: 10Giuseppe Lavagetto) [06:24:55] (03PS2) 10Giuseppe Lavagetto: parsoid: use safe service restarts [puppet] - 10https://gerrit.wikimedia.org/r/518669 [06:26:03] (03PS2) 10Elukey: Introduce an-tool1006 [puppet] - 10https://gerrit.wikimedia.org/r/520055 (https://phabricator.wikimedia.org/T226844) [06:26:23] (03CR) 10Volans: [C: 03+1] "LGTM, just one final question inline." (032 comments) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/517605 (https://phabricator.wikimedia.org/T225945) (owner: 10Vgutierrez) [06:27:21] (03CR) 10Elukey: [C: 03+2] Introduce an-tool1006 [puppet] - 10https://gerrit.wikimedia.org/r/520055 (https://phabricator.wikimedia.org/T226844) (owner: 10Elukey) [06:30:24] PROBLEM - puppet last run on dbproxy1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [06:31:20] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "We still need depool-parsoid and pool-parsoid at the moment." [puppet] - 10https://gerrit.wikimedia.org/r/518669 (owner: 10Giuseppe Lavagetto) [06:31:50] * volans forcing puppet run on dbproxy1003 [06:32:12] PROBLEM - puppet last run on mw1307 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check_ferm] [06:35:32] * volans is not used to be around at cron.daily runs... :-P [06:35:49] RECOVERY - puppet last run on dbproxy1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:41:18] 10Operations, 10MediaWiki-Maintenance-scripts, 10Performance-Team, 10Patch-For-Review: cron spam for slow queries on mwmaint /usr/local/bin/foreachwiki initSiteStats.php --update > /dev/null - https://phabricator.wikimedia.org/T216243 (10Joe) I think we can just let this be resolved by the (upcoming) trans... [06:41:18] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:44:02] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:47:32] cr4-ulsfo is the zayo maintenance for which we just got a reminder mail [06:48:26] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:48:40] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:49:12] https://upload.wikimedia.org/wikipedia/commons/thumb/9/9f/Prag%2C_Nationalmuseum%2C_Brunnen_--_2019_--_6841.jpg/3000px-Prag%2C_Nationalmuseum%2C_Brunnen_--_2019_--_6841.jpg [06:49:26] Request from 223.191.34.26 via cp5006 frontend, Varnish XID 331858364 [06:49:26] Upstream caches: cp5006 int [06:49:26] Error: 500, Internal Server Error at Tue, 02 Jul 2019 06:49:04 GMT [06:50:00] (03PS1) 10Tulsi Bhagat: Configuring Namespaces at pawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520174 (https://phabricator.wikimedia.org/T226959) [06:56:47] (03CR) 10Tulsi Bhagat: "Requires `mwscript namespaceDupes.php --wiki=pawikisource --fix` after deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520174 (https://phabricator.wikimedia.org/T226959) (owner: 10Tulsi Bhagat) [06:59:30] RECOVERY - puppet last run on mw1307 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:01:56] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520175 [07:02:21] (03CR) 10MarcoAurelio: [C: 04-1] Configuring Namespaces at pawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520174 (https://phabricator.wikimedia.org/T226959) (owner: 10Tulsi Bhagat) [07:03:12] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520175 (owner: 10Marostegui) [07:03:20] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 6 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting%23Nova-fullstack [07:03:44] (03CR) 10MarcoAurelio: [C: 04-1] Configuring Namespaces at pawikisource (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520174 (https://phabricator.wikimedia.org/T226959) (owner: 10Tulsi Bhagat) [07:04:02] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520175 (owner: 10Marostegui) [07:04:19] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520175 (owner: 10Marostegui) [07:04:23] looking [07:05:33] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1092 (duration: 00m 49s) [07:05:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:28] (03CR) 10MarcoAurelio: [C: 04-1] Configuring Namespaces at pawikisource (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520174 (https://phabricator.wikimedia.org/T226959) (owner: 10Tulsi Bhagat) [07:08:21] (03CR) 10MarcoAurelio: [C: 04-1] Configuring Namespaces at pawikisource (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520174 (https://phabricator.wikimedia.org/T226959) (owner: 10Tulsi Bhagat) [07:08:37] 10Operations, 10Analytics, 10Analytics-Kanban, 10vm-requests, and 2 others: Create an-tool1006, a ganeti vm to be used as client for the Hadoop test cluster - https://phabricator.wikimedia.org/T226844 (10elukey) 05Open→03Resolved [07:08:40] 10Operations, 10Diffusion, 10Packaging, 10Patch-For-Review, and 4 others: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [07:08:45] 10Operations, 10LDAP-Access-Requests: Grant WMDE engineers access to logstash and creating grafana boards / Add WMDE engineers to 'nda' LDAP group - https://phabricator.wikimedia.org/T225004 (10MoritzMuehlenhoff) >>! In T225004#5296760, @WMDE-leszek wrote: > Re being in WMDE group being equivalent to being in... [07:12:11] (03PS1) 10DCausse: [cirrus] Enable UTR30 as a lookup method for ns prefixes on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520176 [07:13:52] 10Operations, 10WMDE-QWERTY-Team, 10serviceops, 10wikidiff2, and 3 others: Deploy Wikidiff2 version 1.8.2 with the timeout issue fixed - https://phabricator.wikimedia.org/T223391 (10jijiki) a:05Joe→03jijiki [07:15:03] 10Operations, 10WMDE-QWERTY-Team, 10serviceops, 10wikidiff2, and 3 others: Deploy Wikidiff2 version 1.8.2 with the timeout issue fixed - https://phabricator.wikimedia.org/T223391 (10jijiki) @awight @jkroll should we rollout all hosts? [07:15:56] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520177 [07:16:09] (03PS2) 10DCausse: [cirrus] Enable UTR30 as a lookup method for ns prefixes on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520176 [07:16:52] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520177 (owner: 10Marostegui) [07:17:44] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520177 (owner: 10Marostegui) [07:17:59] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520177 (owner: 10Marostegui) [07:20:24] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1092 (duration: 00m 49s) [07:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:35] (03PS1) 10Elukey: Assign role::analytics_test_cluster::client to an-tool1006 [puppet] - 10https://gerrit.wikimedia.org/r/520178 (https://phabricator.wikimedia.org/T226698) [07:24:14] (03CR) 10Elukey: [C: 03+2] Assign role::analytics_test_cluster::client to an-tool1006 [puppet] - 10https://gerrit.wikimedia.org/r/520178 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [07:30:27] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 2 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting%23Nova-fullstack [07:31:02] (03PS2) 10Aaron Schulz: Update my obsolete YubiKey-stored SSH keys [puppet] - 10https://gerrit.wikimedia.org/r/519941 [07:39:16] (03PS1) 10Elukey: role::analytics_test_cluster::client: add kerberos config [puppet] - 10https://gerrit.wikimedia.org/r/520182 (https://phabricator.wikimedia.org/T226698) [07:41:40] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::client: add kerberos config [puppet] - 10https://gerrit.wikimedia.org/r/520182 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [07:43:37] (03CR) 10Marostegui: "> > Patch Set 10:" [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) (owner: 10Jcrespo) [07:47:55] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [07:47:56] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:59] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [07:48:00] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:17] (03CR) 10DCausse: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520078 (https://phabricator.wikimedia.org/T221916) (owner: 10Smalyshev) [07:48:53] !log draining restbase2019 for eventual reboot for MDS security updates / OpenJDK security update [07:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:14] (03PS1) 10Ema: cache: reimage cp2018 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520183 (https://phabricator.wikimedia.org/T226637) [07:52:52] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] DNS: Add mgmt and production DNS for cloudbackup200[1-2] [dns] - 10https://gerrit.wikimedia.org/r/520137 (owner: 10Papaul) [07:53:29] (03CR) 10Vgutierrez: [C: 03+1] cache: reimage cp2018 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520183 (https://phabricator.wikimedia.org/T226637) (owner: 10Ema) [07:55:53] !log depool cp2018 and reimage as upload_ats T226637 [07:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:00] T226637: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 [07:57:21] (03PS2) 10Ema: cache: reimage cp2018 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520183 (https://phabricator.wikimedia.org/T226637) [07:58:42] !log draining restbase2020 for eventual reboot for MDS security updates / OpenJDK security update [07:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:48] (03CR) 10Ema: [C: 03+2] cache: reimage cp2018 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520183 (https://phabricator.wikimedia.org/T226637) (owner: 10Ema) [08:02:30] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp2018.codfw.wmnet'] ` The log can be found in `... [08:04:34] 10Operations, 10Wikimedia-Site-requests: Global rename of Fiona B. → Fiona*: supervision needed - https://phabricator.wikimedia.org/T224348 (10jcrespo) 05Stalled→03Open @Itti, there should be no reason to block this anymore, as far as I been told and I can see, renames should be (almost) instant and not lo... [08:05:17] (03CR) 10Volans: [C: 04-1] "Cumin is a general purpose software, it should not have WMF-specific stuff, see inline for the details." (035 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) (owner: 10CRusnov) [08:10:23] PROBLEM - Disk space on an-tool1006 is CRITICAL: DISK CRITICAL - /mnt/hdfs is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [08:10:23] !log restbase spare hosts, mask and stop restbase - T227054 [08:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:28] T227054: spare restbase servers sending traffic to non listening port - https://phabricator.wikimedia.org/T227054 [08:13:47] 10Operations, 10Wikimedia-Site-requests: Global rename of Waldir → Waldyrious: supervision needed - https://phabricator.wikimedia.org/T225370 (10jcrespo) I am not a developer, but to me T225370#5298483 would seem like an intended thing. You may want to document that on the rename user documentation. Wikitech,... [08:19:39] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [08:19:41] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:34] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [08:20:52] !log draining restbase1016 for eventual reboot for MDS security updates / OpenJDK security update [08:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:02] 10Operations, 10observability, 10serviceops: Gather metrics on request status codes, latencies from the MediaWiki appservers - https://phabricator.wikimedia.org/T226815 (10fgiunchedi) Indeed I think mtail to extract metrics from apache logs is the best way we have, I'm not aware of apache exposing more of it... [08:23:13] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 (10fgiunchedi) [08:23:21] 10Operations, 10Operations-Software-Development, 10serviceops-radar, 10Patch-For-Review, and 3 others: Convert makevm to spicerack cookbook - https://phabricator.wikimedia.org/T203963 (10elukey) @Volans @crusnov I created https://wikitech.wikimedia.org/wiki/Ganeti#Create_the_VM_(using_the_cookbook_sre.gane... [08:25:35] elukey: you comment first and edit after... I've a conflict now :-P [08:25:55] (03PS2) 10Tulsi Bhagat: Configuring Namespaces at pawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520174 (https://phabricator.wikimedia.org/T226959) [08:26:43] (03PS1) 10Ema: tlsproxy: prometheus endpoint logging configuration [puppet] - 10https://gerrit.wikimedia.org/r/520186 [08:29:05] PROBLEM - HHVM jobrunner on mw1295 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [08:29:49] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "depool-eventbus and pool-eventbus are still needed." [puppet] - 10https://gerrit.wikimedia.org/r/518670 (owner: 10Giuseppe Lavagetto) [08:29:51] (03CR) 10Vgutierrez: [C: 03+1] tlsproxy: prometheus endpoint logging configuration [puppet] - 10https://gerrit.wikimedia.org/r/520186 (owner: 10Ema) [08:29:55] PROBLEM - Nginx local proxy to jobrunner on mw1295 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [08:30:08] (03PS1) 10Filippo Giunchedi: varnish: remove varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/520187 (https://phabricator.wikimedia.org/T184942) [08:30:13] RECOVERY - HHVM jobrunner on mw1295 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [08:30:41] (03CR) 10Ema: [C: 03+2] tlsproxy: prometheus endpoint logging configuration [puppet] - 10https://gerrit.wikimedia.org/r/520186 (owner: 10Ema) [08:31:01] RECOVERY - Nginx local proxy to jobrunner on mw1295 is OK: HTTP OK: HTTP/1.1 200 OK - 288 bytes in 0.011 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [08:31:12] (03CR) 10Filippo Giunchedi: [C: 04-1] "Can't be merged yet until all dashboards have been migrated, should be good to go though" [puppet] - 10https://gerrit.wikimedia.org/r/520187 (https://phabricator.wikimedia.org/T184942) (owner: 10Filippo Giunchedi) [08:33:58] 10Operations, 10Cloud-VPS, 10LDAP, 10cloud-services-team (Kanban): investigate slapd memory leak - https://phabricator.wikimedia.org/T130593 (10aborrero) [08:34:06] 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP, and 2 others: LDAP server running out of memory frequently and disrupting Cloud VPS clients - https://phabricator.wikimedia.org/T217280 (10aborrero) 05Open→03Resolved a:03aborrero It seems our long term plan is to introduce sssd to all VMs in CloudVPS.... [08:34:57] 10Operations, 10WMDE-QWERTY-Team, 10serviceops, 10wikidiff2, and 3 others: Deploy Wikidiff2 version 1.8.2 with the timeout issue fixed - https://phabricator.wikimedia.org/T223391 (10Joe) >>! In T223391#5299032, @jijiki wrote: > @awight @jkroll should we rollout to all hosts? I think we should, I didn't ge... [08:36:26] !log draining restbase1017 for eventual reboot for MDS security updates / OpenJDK security update [08:36:29] 10Operations, 10WMDE-QWERTY-Team, 10serviceops, 10wikidiff2, and 3 others: Deploy Wikidiff2 version 1.8.2 with the timeout issue fixed - https://phabricator.wikimedia.org/T223391 (10WMDE-Fisch) >>! In T223391#5299264, @Joe wrote: >>>! In T223391#5299032, @jijiki wrote: >> @awight @jkroll should we rollout... [08:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:43] (03PS5) 10Filippo Giunchedi: logstash: add consumer for client errors [puppet] - 10https://gerrit.wikimedia.org/r/519603 (https://phabricator.wikimedia.org/T217142) [08:40:41] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2018.codfw.wmnet'] ` and were **ALL** successful. [08:49:38] 10Operations, 10Analytics, 10Wikimedia-Incident: Move icinga alarm for the EventStreams external endpoint to SRE - https://phabricator.wikimedia.org/T227065 (10elukey) [08:50:05] !log pool cp2018 w/ ATS backend T226637 [08:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:10] T226637: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 [08:50:19] (03CR) 10Filippo Giunchedi: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/519603 (https://phabricator.wikimedia.org/T217142) (owner: 10Filippo Giunchedi) [08:51:58] (03PS1) 10Ema: cache: reimage cp2020 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520190 (https://phabricator.wikimedia.org/T226637) [08:52:10] !log draining restbase1018 for eventual reboot for MDS security updates / OpenJDK security update [08:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:21] 10Operations, 10Analytics, 10Wikimedia-Incident: Move icinga alarm for the EventStreams external endpoint to SRE - https://phabricator.wikimedia.org/T227065 (10elukey) [08:52:56] (03CR) 10Vgutierrez: [C: 03+1] cache: reimage cp2020 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520190 (https://phabricator.wikimedia.org/T226637) (owner: 10Ema) [08:55:04] !log depool cp2020 and reimage as upload_ats T226637 [08:55:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:28] (03CR) 10Ema: [C: 03+2] cache: reimage cp2020 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520190 (https://phabricator.wikimedia.org/T226637) (owner: 10Ema) [08:55:59] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [08:56:00] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:06] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [08:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:08] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:52] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp2020.codfw.wmnet'] ` The log can be found in `... [09:00:54] !log draining restbase1019 for eventual reboot for MDS security updates / OpenJDK security update [09:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:36] (03PS3) 10Fsero: bug: systemd.timers does not support random on jessie [puppet] - 10https://gerrit.wikimedia.org/r/520018 [09:03:22] 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP, and 2 others: LDAP server running out of memory frequently and disrupting Cloud VPS clients - https://phabricator.wikimedia.org/T217280 (10hashar) +1 on resolution. I haven't encounter any time out since the sprint of action in March 2019. Thank you! [09:03:32] (03CR) 10jerkins-bot: [V: 04-1] bug: systemd.timers does not support random on jessie [puppet] - 10https://gerrit.wikimedia.org/r/520018 (owner: 10Fsero) [09:05:19] (03PS4) 10Fsero: bug: systemd.timers does not support random on jessie [puppet] - 10https://gerrit.wikimedia.org/r/520018 [09:06:10] (03CR) 10jerkins-bot: [V: 04-1] bug: systemd.timers does not support random on jessie [puppet] - 10https://gerrit.wikimedia.org/r/520018 (owner: 10Fsero) [09:07:35] (03PS3) 10Jcrespo: mariadb: Prepare core for buster [puppet] - 10https://gerrit.wikimedia.org/r/519073 (https://phabricator.wikimedia.org/T193224) [09:07:37] (03PS11) 10Jcrespo: prometheus-mysqld-exporter: Automate targets based on zarcillo db [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) [09:10:13] (03PS5) 10Fsero: systemd.timers: bug: does not support random on jessie [puppet] - 10https://gerrit.wikimedia.org/r/520018 [09:11:21] !log draining restbase1020 for eventual reboot for MDS security updates / OpenJDK security update [09:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:47] (03CR) 10Marostegui: "Should we also add a .sql file with the grants the user will have? Same way as we do with some other hosts (like misc) just for tracking t" [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) (owner: 10Jcrespo) [09:12:34] (03PS3) 10Marostegui: mariadb: Promote db1120 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/519185 (https://phabricator.wikimedia.org/T226358) [09:12:51] (03PS4) 10Marostegui: wmnet: Change x1-master to point to the new master [dns] - 10https://gerrit.wikimedia.org/r/519186 (https://phabricator.wikimedia.org/T226358) [09:12:57] (03PS3) 10Marostegui: db-eqiad.php: Promote db1120 to x1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519187 (https://phabricator.wikimedia.org/T226358) [09:14:25] (03CR) 10Ema: [C: 03+1] "LGTM! Some minor comments inline." (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [09:17:20] (03PS1) 10Elukey: role::analytics_test_cluster::client: add spark2 config [puppet] - 10https://gerrit.wikimedia.org/r/520195 (https://phabricator.wikimedia.org/T226698) [09:17:39] (03PS1) 10ArielGlenn: set dump cron start dates and times back to normal [puppet] - 10https://gerrit.wikimedia.org/r/520196 [09:18:34] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::client: add spark2 config [puppet] - 10https://gerrit.wikimedia.org/r/520195 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [09:20:47] (03PS2) 10ArielGlenn: set dump cron start dates and times back to normal [puppet] - 10https://gerrit.wikimedia.org/r/520196 [09:21:49] (03CR) 10ArielGlenn: [C: 03+2] set dump cron start dates and times back to normal [puppet] - 10https://gerrit.wikimedia.org/r/520196 (owner: 10ArielGlenn) [09:22:34] !log draining restbase1021 for eventual reboot for MDS security updates / OpenJDK security update [09:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:59] (03CR) 10Fsero: "PCC seems happy" [puppet] - 10https://gerrit.wikimedia.org/r/520018 (owner: 10Fsero) [09:28:18] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [09:28:20] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:13] !log draining restbase1022 for eventual reboot for MDS security updates / OpenJDK security update [09:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:15] (03CR) 10Fsero: "fleet wide PCC https://puppet-compiler.wmflabs.org/compiler1002/17184/ (still in progress)" [puppet] - 10https://gerrit.wikimedia.org/r/520018 (owner: 10Fsero) [09:30:17] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/520018 (owner: 10Fsero) [09:31:19] (03PS6) 10Fsero: systemd.timers: bug: does not support random on jessie [puppet] - 10https://gerrit.wikimedia.org/r/520018 [09:32:17] (03PS1) 10Elukey: role::analytics_test_cluster::client: fix coordinator hostname [puppet] - 10https://gerrit.wikimedia.org/r/520198 [09:32:25] (03CR) 10Fsero: [C: 03+2] systemd.timers: bug: does not support random on jessie [puppet] - 10https://gerrit.wikimedia.org/r/520018 (owner: 10Fsero) [09:32:44] (03PS2) 10Elukey: role::analytics_test_cluster::client: fix coordinator hostname [puppet] - 10https://gerrit.wikimedia.org/r/520198 [09:34:01] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::client: fix coordinator hostname [puppet] - 10https://gerrit.wikimedia.org/r/520198 (owner: 10Elukey) [09:34:30] !log Upgrade mysql on 2080 db2081 db2083 - T227062 [09:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:35] T227062: Failover s8 (wikidatawiki) db primary master db1071 to db1104 (read-only required) - https://phabricator.wikimedia.org/T227062 [09:37:40] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2020.codfw.wmnet'] ` and were **ALL** successful. [09:38:24] (03CR) 10Hoo man: "Adding an isset check would be more robust, I guess… but also more clutter :S" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518239 (https://phabricator.wikimedia.org/T225212) (owner: 10Lucas Werkmeister (WMDE)) [09:39:07] !log draining restbase1023 for eventual reboot for MDS security updates / OpenJDK security update [09:39:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:11] !log Upgrade db2094 (codfw sanitarium) T227062 [09:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:57] (03PS1) 10Fsero: helmfile: bug: contint* timers require full command path [puppet] - 10https://gerrit.wikimedia.org/r/520199 [09:44:18] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/519941 (owner: 10Aaron Schulz) [09:44:21] (03CR) 10Fsero: "PCC looks good https://puppet-compiler.wmflabs.org/compiler1002/17186/contint1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/520199 (owner: 10Fsero) [09:44:30] (03CR) 10Fsero: [C: 03+2] helmfile: bug: contint* timers require full command path [puppet] - 10https://gerrit.wikimedia.org/r/520199 (owner: 10Fsero) [09:44:50] 10Operations, 10Operations-Software-Development, 10serviceops-radar, 10Patch-For-Review, and 3 others: Convert makevm to spicerack cookbook - https://phabricator.wikimedia.org/T203963 (10Volans) Thanks a lot @elukey. I've just added a small detail and formatted hosts and paths. For the status of the task,... [09:46:50] !log draining restbase1024 for eventual reboot for MDS security updates / OpenJDK security update [09:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:30] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [09:48:32] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:00] RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:50:00] !log rebooting secondary lvs servers for MDS security updates [09:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:08] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:52:07] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/520157 (https://phabricator.wikimedia.org/T227052) (owner: 10Dzahn) [09:53:28] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [09:53:28] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:08] RECOVERY - puppet last run on deploy2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:55:21] !log draining restbase1025 for eventual reboot for MDS security updates / OpenJDK security update [09:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:28] 10Operations, 10Operations-Software-Development, 10serviceops-radar, 10Patch-For-Review, and 3 others: Convert makevm to spicerack cookbook - https://phabricator.wikimedia.org/T203963 (10akosiaris) >>! In T203963#5297888, @Volans wrote: > @Dzahn the hardcoded MAC addesses will soon not be needed anymore !log restart rsyslog on wezen - T199406 [09:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:45] T199406: rsyslog's in:imtcp thread stuck on old sockets - https://phabricator.wikimedia.org/T199406 [10:00:14] RECOVERY - puppet last run on deploy1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:00:48] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [10:00:48] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:00:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:52] 10Operations, 10MediaWiki-Logging, 10Wikimedia-Logstash, 10serviceops, and 8 others: Port mediawiki/php/wmerrors to PHP7 and deploy - https://phabricator.wikimedia.org/T187147 (10Joe) Talking with @Krinkle I realized that the reason why I saw that ugly error message is because the endpoint doesn't initiali... [10:02:18] !log pool cp2020 w/ ATS backend T226637 [10:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:24] T226637: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 [10:05:19] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [10:05:20] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:58] !log powercycle analytics1056 (soft lockups logged in the serial console, no ssh, no net connectivity) [10:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:38] RECOVERY - Host analytics1056 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [10:10:38] (03CR) 10Matthias Geisler: Enable DataBridge on Beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519987 (https://phabricator.wikimedia.org/T226816) (owner: 10Matthias Geisler) [10:10:42] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [10:13:47] ACKNOWLEDGEMENT - snapshot of s3 in codfw on db1115 is CRITICAL: snapshot for s3 at codfw taken more than 4 days ago: Most recent backup 2019-06-28 05:31:32 Jcrespo retrying s3 and s6 snapshots on codfw - The acknowledgement expires at: 2019-07-03 14:13:01. https://wikitech.wikimedia.org/wiki/MariaDB/Backups [10:13:47] ACKNOWLEDGEMENT - snapshot of s6 in codfw on db1115 is CRITICAL: snapshot for s6 at codfw taken more than 4 days ago: Most recent backup 2019-06-28 03:33:36 Jcrespo retrying s3 and s6 snapshots on codfw - The acknowledgement expires at: 2019-07-03 14:13:01. https://wikitech.wikimedia.org/wiki/MariaDB/Backups [10:15:07] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [10:15:08] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:20] !log draining restbase1026 for eventual reboot for MDS security updates / OpenJDK security update [10:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:00] !log draining restbase1027 for eventual reboot for MDS security updates / OpenJDK security update [10:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:35] (03PS1) 10Filippo Giunchedi: (WIP) toil: add rsyslog TLS remedy [puppet] - 10https://gerrit.wikimedia.org/r/520207 (https://phabricator.wikimedia.org/T199406) [10:23:19] (03CR) 10jerkins-bot: [V: 04-1] (WIP) toil: add rsyslog TLS remedy [puppet] - 10https://gerrit.wikimedia.org/r/520207 (https://phabricator.wikimedia.org/T199406) (owner: 10Filippo Giunchedi) [10:26:24] 10Operations, 10Operations-Software-Development, 10serviceops-radar, 10Patch-For-Review, and 3 others: Convert makevm to spicerack cookbook - https://phabricator.wikimedia.org/T203963 (10Volans) @akosiaris the "plan" was partially explained as part of the bare metal/host provisioning breakout session at th... [10:27:40] (03PS1) 10Jbond: labs/private: update some keys [labs/private] - 10https://gerrit.wikimedia.org/r/520208 [10:27:52] 10Operations, 10Operations-Software-Development, 10serviceops-radar, 10Patch-For-Review, and 3 others: Convert makevm to spicerack cookbook - https://phabricator.wikimedia.org/T203963 (10akosiaris) >>! In T203963#5299546, @Volans wrote: > @akosiaris the "plan" was partially explained as part of the bare me... [10:30:46] (03Abandoned) 10Jbond: labs/private: update some keys [labs/private] - 10https://gerrit.wikimedia.org/r/520208 (owner: 10Jbond) [10:32:17] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [10:32:17] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:21] 10Operations, 10Datasets-Archiving, 10Datasets-General-or-Unknown, 10Dumps-Generation: Image tarball dumps on your.org are not being generated - https://phabricator.wikimedia.org/T53001 (10ArielGlenn) Some notes on architecture of the media sync scripts above: - The plan is that these would only every ru... [10:41:22] (03CR) 10Muehlenhoff: [C: 03+1] "Verified the signature on the key, I'll merge to readd Daniel's access." [puppet] - 10https://gerrit.wikimedia.org/r/520157 (https://phabricator.wikimedia.org/T227052) (owner: 10Dzahn) [10:41:38] (03PS2) 10Muehlenhoff: admins: restore shell access for myself, dzahn [puppet] - 10https://gerrit.wikimedia.org/r/520157 (https://phabricator.wikimedia.org/T227052) (owner: 10Dzahn) [10:45:07] !log Rollout Wikidiff 1.8.2 to codfw - T223391 [10:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:12] T223391: Deploy Wikidiff2 version 1.8.2 with the timeout issue fixed - https://phabricator.wikimedia.org/T223391 [10:47:43] !log Rollout Wikidiff 1.8.2 to eqiad - T223391 [10:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:39] (03PS1) 10Jbond: labs/private: update value [labs/private] - 10https://gerrit.wikimedia.org/r/520209 [10:50:03] jouncebot next [10:50:03] In 0 hour(s) and 9 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190702T1100) [10:50:09] jouncebot now [10:50:10] No deployments scheduled for the next 0 hour(s) and 9 minute(s) [10:50:21] (03CR) 10Muehlenhoff: [C: 03+2] admins: restore shell access for myself, dzahn [puppet] - 10https://gerrit.wikimedia.org/r/520157 (https://phabricator.wikimedia.org/T227052) (owner: 10Dzahn) [10:54:51] (03CR) 10Jcrespo: "> Patch Set 11:" [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) (owner: 10Jcrespo) [10:59:36] (03PS2) 10Jbond: labs/private: profile::restbase::salt_key: update value [labs/private] - 10https://gerrit.wikimedia.org/r/520209 [10:59:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] labs/private: profile::restbase::salt_key: update value [labs/private] - 10https://gerrit.wikimedia.org/r/520209 (owner: 10Jbond) [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190702T1100). [11:00:05] dcausse: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:12] o/ [11:00:16] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labs/private: profile::restbase::salt_key: update value [labs/private] - 10https://gerrit.wikimedia.org/r/520209 (owner: 10Jbond) [11:00:20] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] labs/private: profile::restbase::salt_key: update value [labs/private] - 10https://gerrit.wikimedia.org/r/520209 (owner: 10Jbond) [11:00:27] I can SWAT [11:00:32] I can’t SWAT today, meeting [11:00:35] sorry [11:01:31] (03CR) 10DCausse: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520176 (owner: 10DCausse) [11:02:01] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: restore shell access for dzahn - https://phabricator.wikimedia.org/T227052 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff I've verified the signed key and merged the key [11:02:36] (03Merged) 10jenkins-bot: [cirrus] Enable UTR30 as a lookup method for ns prefixes on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520176 (owner: 10DCausse) [11:02:53] (03CR) 10jenkins-bot: [cirrus] Enable UTR30 as a lookup method for ns prefixes on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520176 (owner: 10DCausse) [11:04:46] (03CR) 10Alaa Sarhan: [C: 03+1] dumpwikidatajson: Fix error code detection [puppet] - 10https://gerrit.wikimedia.org/r/519494 (https://phabricator.wikimedia.org/T226601) (owner: 10Hoo man) [11:06:08] !log dcausse@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [cirrus] Enable UTR30 as a lookup method for ns prefixes on group1 (duration: 00m 50s) [11:06:08] 10Operations, 10Wikimedia-Mailing-lists: Create MoveCom mailing list for Movement communications group - https://phabricator.wikimedia.org/T218367 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff I'm closing this task, @Varnent please reopen if anything else is needed. [11:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:36] dcausse: ping me when swat is done please, so I can trigger php-fpm restarts [11:07:41] thanx :) [11:07:47] jijiki: sure! [11:08:08] jijiki: is it long because I'm just waiting for CI on an extension so if it takes ~10mins please go ahead [11:08:14] no worries [11:08:25] ok [11:08:27] doing it right beats doing it fast :p [11:08:50] so I'd rather wait [11:08:56] sure! [11:09:39] 10Operations, 10ORES, 10Scoring-platform-team, 10vm-requests: New node request: oresrdb[12]003 - https://phabricator.wikimedia.org/T210582 (10MoritzMuehlenhoff) 05Open→03Stalled p:05High→03Normal [11:15:44] !log dcausse@deploy1001 Synchronized php-1.34.0-wmf.11/extensions/CirrusSearch/includes/Updater.php: T226592: Ignore broken redirects when updating incoming link counts (duration: 00m 49s) [11:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:50] T226592: Fatal error from CirrusSearch/IncomingLinkCount Job: Argument to WikiPage::__construct must be Title - https://phabricator.wikimedia.org/T226592 [11:16:20] !log EU Swat done [11:16:23] jijiki: ^ [11:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:27] tx :D [11:19:09] PROBLEM - very high load average likely xfs on ms-be1021 is CRITICAL: CRITICAL - load average: 159.91, 103.16, 53.22 https://wikitech.wikimedia.org/wiki/Swift [11:19:35] PROBLEM - MD RAID on ms-be1021 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [11:19:36] ACKNOWLEDGEMENT - MD RAID on ms-be1021 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T227076 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [11:19:40] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1021 - https://phabricator.wikimedia.org/T227076 (10ops-monitoring-bot) [11:22:03] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1021 - https://phabricator.wikimedia.org/T227076 (10jijiki) p:05Triage→03High [11:22:41] PROBLEM - Check systemd state on ms-be1021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:22:59] PROBLEM - Disk space on ms-be1021 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdg1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [11:23:36] !log temporarily disabled meta monitoring for icinga2001 [11:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:19] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [11:25:19] !log jmm@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [11:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:04] !log Run restart-php-fpm in all-mw-eqiad - T223391 [11:26:08] !log rebooting icinga2001 for kernel security update [11:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:09] T223391: Deploy Wikidiff2 version 1.8.2 with the timeout issue fixed - https://phabricator.wikimedia.org/T223391 [11:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:02] !log re-enabled meta monitoring for icinga2001 [11:33:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:33] (03PS1) 10Urbanecm: Add new throttle rule for cswiki workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520214 (https://phabricator.wikimedia.org/T225555) [11:33:37] !log Reopen EU SWAT for last-time throttle rule [11:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:08] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520214 (https://phabricator.wikimedia.org/T225555) (owner: 10Urbanecm) [11:35:00] (03Merged) 10jenkins-bot: Add new throttle rule for cswiki workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520214 (https://phabricator.wikimedia.org/T225555) (owner: 10Urbanecm) [11:36:40] (03CR) 10jenkins-bot: Add new throttle rule for cswiki workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520214 (https://phabricator.wikimedia.org/T225555) (owner: 10Urbanecm) [11:37:01] !log urbanecm@deploy1001 Synchronized wmf-config/throttle.php: SWAT: [[:gerrit:520214|Add new throttle rule for cswiki workshop]] (T225555) (duration: 00m 49s) [11:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:12] T225555: Increase 6-account limit for educational workshop on 2019-07-02 - https://phabricator.wikimedia.org/T225555 [11:37:45] !log Ran mwscript resetAuthenticationThrottle.php --wiki=metawiki --signup --ip 86.49.134.37 for T225555 [11:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:11] !log EU SWAT really done [11:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:25] heh [11:44:39] 10Operations, 10ops-esams, 10Traffic: cp3037 is currently unreachable - https://phabricator.wikimedia.org/T222041 (10ema) >>! In T222041#5295444, @Joe wrote: > Can someone start the decommission process? this host shows up in things like debdeploy runs or cumin runs and that's distracting. +1 [11:48:39] (03PS1) 10Ema: cache: reimage cp2022 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520216 (https://phabricator.wikimedia.org/T226637) [11:50:09] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/520216 (https://phabricator.wikimedia.org/T226637) (owner: 10Ema) [11:51:04] !log depool cp2022 and reimage as upload_ats T226637 [11:51:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:09] T226637: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 [11:51:55] (03CR) 10Ema: [C: 03+2] cache: reimage cp2022 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520216 (https://phabricator.wikimedia.org/T226637) (owner: 10Ema) [11:52:25] PROBLEM - swift-object-updater on ms-be1021 is CRITICAL: connect to address 10.64.48.54 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Swift [11:52:29] PROBLEM - swift-account-server on ms-be1021 is CRITICAL: connect to address 10.64.48.54 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Swift [11:52:31] PROBLEM - dhclient process on ms-be1021 is CRITICAL: connect to address 10.64.48.54 port 5666: Connection refused [11:52:35] PROBLEM - swift-container-server on ms-be1021 is CRITICAL: connect to address 10.64.48.54 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Swift [11:52:39] PROBLEM - configured eth on ms-be1021 is CRITICAL: connect to address 10.64.48.54 port 5666: Connection refused [11:52:43] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1021 is CRITICAL: connect to address 10.64.48.54 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:52:49] PROBLEM - swift-account-reaper on ms-be1021 is CRITICAL: connect to address 10.64.48.54 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Swift [11:52:49] PROBLEM - swift-object-server on ms-be1021 is CRITICAL: connect to address 10.64.48.54 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Swift [11:52:59] PROBLEM - swift-container-replicator on ms-be1021 is CRITICAL: connect to address 10.64.48.54 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Swift [11:53:01] looking what's up with ms-be1021 [11:53:11] PROBLEM - swift-object-replicator on ms-be1021 is CRITICAL: connect to address 10.64.48.54 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Swift [11:53:17] ema: broken disk [11:53:22] I tried to ssh [11:53:25] PROBLEM - swift-container-updater on ms-be1021 is CRITICAL: connect to address 10.64.48.54 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Swift [11:53:26] to no wvail [11:53:28] avail* [11:53:29] PROBLEM - puppet last run on ms-be1021 is CRITICAL: connect to address 10.64.48.54 port 5666: Connection refused [11:53:29] PROBLEM - swift-container-auditor on ms-be1021 is CRITICAL: connect to address 10.64.48.54 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Swift [11:53:33] PROBLEM - swift-account-auditor on ms-be1021 is CRITICAL: connect to address 10.64.48.54 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Swift [11:53:39] PROBLEM - swift-account-replicator on ms-be1021 is CRITICAL: connect to address 10.64.48.54 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Swift [11:53:41] PROBLEM - DPKG on ms-be1021 is CRITICAL: connect to address 10.64.48.54 port 5666: Connection refused [11:53:47] PROBLEM - swift-object-auditor on ms-be1021 is CRITICAL: connect to address 10.64.48.54 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Swift [11:53:51] PROBLEM - Check size of conntrack table on ms-be1021 is CRITICAL: connect to address 10.64.48.54 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [11:54:18] I am putting the host in downtime for now [11:54:27] ack thanks jijiki [11:54:28] till we figure out what to do [11:55:39] PROBLEM - Check the NTP synchronisation status of timesyncd on ms-be1021 is CRITICAL: connect to address 10.64.48.54 port 5666: Connection refused [11:56:46] I got a wall of these in console: [11:56:46] ** 8 printk messages dropped ** [2843324.810077] sd 0:1:0:12: rejecting I/O to offline device [11:57:43] yeah there was a generated email about its disk [11:58:05] RECOVERY - snapshot of s6 in codfw on db1115 is OK: snapshot for s6 at codfw taken less than 4 days ago and larger than 90 GB: Last one 2019-07-02 10:56:53 from db2097.codfw.wmnet:3316 (488 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [11:58:08] lets wait for godog to get back and give instructions what to do with it [11:59:42] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp2022.codfw.wmnet'] ` The log can be found in `... [12:02:41] (03PS1) 10Muehlenhoff: Decommission cp3037 [puppet] - 10https://gerrit.wikimedia.org/r/520218 (https://phabricator.wikimedia.org/T227077) [12:02:49] PROBLEM - Check systemd state on puppetdb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:05:41] thanks jijiki ema, probably a reboot is warranted [12:06:14] ema: are you still on the console? [12:06:58] ^^ looking at puppetdb1001 [12:09:04] (03PS1) 10Ladsgroup: labs: Set wmgWikibaseTmpPropertyTermsMigrationStage to MIGRATION_WRITE_NEW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520220 (https://phabricator.wikimedia.org/T225053) [12:10:06] (03PS1) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [debs/coredns] (debian/sid) - 10https://gerrit.wikimedia.org/r/520221 [12:10:22] (03PS2) 10Filippo Giunchedi: (WIP) toil: add rsyslog TLS remedy [puppet] - 10https://gerrit.wikimedia.org/r/520207 (https://phabricator.wikimedia.org/T199406) [12:11:08] (03CR) 10Hashar: "recheck" [debs/coredns] (debian/sid) - 10https://gerrit.wikimedia.org/r/520221 (owner: 10Hashar) [12:12:51] (03CR) 10Daimona Eaytoy: "> Is there a reason some permissions are set in InitialiseSettings vs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 (owner: 10Daimona Eaytoy) [12:13:07] 10Operations, 10media-storage: Not possible to server-side upload certain images: "An unknown error occurred in storage backend "local-swift-eqiad"" - https://phabricator.wikimedia.org/T226937 (10fgiunchedi) >>! In T226937#5296388, @Urbanecm wrote: >>>! In T226937#5296202, @Reedy wrote: >> Is there any way to... [12:13:19] (03CR) 10Ladsgroup: [C: 03+2] "noop for prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520220 (https://phabricator.wikimedia.org/T225053) (owner: 10Ladsgroup) [12:14:36] (03Merged) 10jenkins-bot: labs: Set wmgWikibaseTmpPropertyTermsMigrationStage to MIGRATION_WRITE_NEW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520220 (https://phabricator.wikimedia.org/T225053) (owner: 10Ladsgroup) [12:14:43] ^ rebased [12:15:30] jijiki: I'm not [12:15:55] RECOVERY - Check systemd state on puppetdb1001 is OK: OK - running: The system is fully operational [12:16:34] (03CR) 10jenkins-bot: labs: Set wmgWikibaseTmpPropertyTermsMigrationStage to MIGRATION_WRITE_NEW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520220 (https://phabricator.wikimedia.org/T225053) (owner: 10Ladsgroup) [12:17:46] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [debs/coredns] (debian/sid) - 10https://gerrit.wikimedia.org/r/520221 (owner: 10Hashar) [12:18:55] (03PS24) 10Daimona Eaytoy: Update AbuseFilter config to keep the status quo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 [12:23:31] (03PS1) 10Effie Mouzeli: role::puppetmaster::puppetdb Add postgres_exporter [puppet] - 10https://gerrit.wikimedia.org/r/520223 [12:23:35] 10Operations, 10Continuous-Integration-Infrastructure: Jessie rsyslog_8.1901.0-1~bpo8+wmf1_amd64.deb package fails to upgrade - https://phabricator.wikimedia.org/T222166 (10fgiunchedi) >>! In T222166#5296941, @MoritzMuehlenhoff wrote: > @fgiunchedi Shall we close this task? the current jessie package is rolled... [12:26:49] (03PS25) 10Daimona Eaytoy: Update AbuseFilter config to keep the status quo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 [12:26:56] 10Operations, 10observability, 10Goal, 10User-fgiunchedi: TEC6: Upgrade metrics monitoring infrastructure core components (Q3 2018/19 goal) - https://phabricator.wikimedia.org/T213288 (10fgiunchedi) [12:27:12] 10Operations, 10observability, 10Goal, 10User-fgiunchedi: TEC6: Upgrade metrics monitoring infrastructure core components (Q3 2018/19 goal) - https://phabricator.wikimedia.org/T213288 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This was completed, resolving [12:28:24] (03CR) 10Alexandros Kosiaris: [C: 04-1] Give scaffold template configuration options for dev purposes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/519485 (https://phabricator.wikimedia.org/T226660) (owner: 10Jeena Huneidi) [12:29:04] ema: godog I am going to power reset the server is you are ok [12:29:30] sounds good, thanks jijiki ! [12:30:22] !log Power cycle ms-be1021 - T227076 [12:30:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:58] (03PS1) 10Jbond: puppetdb: add ferm rule for prometheus-postgres-exporter [puppet] - 10https://gerrit.wikimedia.org/r/520224 [12:31:02] T227076: Degraded RAID on ms-be1021 - https://phabricator.wikimedia.org/T227076 [12:31:27] (03CR) 10Alexandros Kosiaris: [C: 04-1] "My finally comment would be that since we added more port functionality, we should also amend" [deployment-charts] - 10https://gerrit.wikimedia.org/r/519485 (https://phabricator.wikimedia.org/T226660) (owner: 10Jeena Huneidi) [12:32:16] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/520224 (owner: 10Jbond) [12:33:21] (03PS26) 10Daimona Eaytoy: Update AbuseFilter config to keep the status quo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 [12:36:00] (03PS2) 10Effie Mouzeli: role::puppetmaster::puppetdb Add postgres_exporter [puppet] - 10https://gerrit.wikimedia.org/r/520223 [12:36:27] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/520223 (owner: 10Effie Mouzeli) [12:36:59] (03Abandoned) 10Jbond: puppetdb: add ferm rule for prometheus-postgres-exporter [puppet] - 10https://gerrit.wikimedia.org/r/520224 (owner: 10Jbond) [12:37:40] RECOVERY - dhclient process on ms-be1021 is OK: PROCS OK: 0 processes with command name dhclient [12:37:40] RECOVERY - swift-container-server on ms-be1021 is OK: PROCS OK: 21 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server https://wikitech.wikimedia.org/wiki/Swift [12:37:40] RECOVERY - swift-object-replicator on ms-be1021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator https://wikitech.wikimedia.org/wiki/Swift [12:37:40] RECOVERY - Check size of conntrack table on ms-be1021 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [12:37:40] RECOVERY - MD RAID on ms-be1021 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [12:37:44] RECOVERY - swift-object-updater on ms-be1021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater https://wikitech.wikimedia.org/wiki/Swift [12:37:53] (03CR) 10Effie Mouzeli: [V: 03+1] "LGTM https://puppet-compiler.wmflabs.org/compiler1002/17191/puppetdb1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/520223 (owner: 10Effie Mouzeli) [12:37:54] RECOVERY - configured eth on ms-be1021 is OK: OK - interfaces up [12:38:04] RECOVERY - Check systemd state on ms-be1021 is OK: OK - running: The system is fully operational [12:38:04] RECOVERY - swift-container-updater on ms-be1021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater https://wikitech.wikimedia.org/wiki/Swift [12:38:06] RECOVERY - swift-account-server on ms-be1021 is OK: PROCS OK: 21 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server https://wikitech.wikimedia.org/wiki/Swift [12:38:10] RECOVERY - swift-container-auditor on ms-be1021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor https://wikitech.wikimedia.org/wiki/Swift [12:38:14] RECOVERY - swift-account-auditor on ms-be1021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor https://wikitech.wikimedia.org/wiki/Swift [12:38:24] RECOVERY - swift-object-auditor on ms-be1021 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor https://wikitech.wikimedia.org/wiki/Swift [12:38:30] RECOVERY - very high load average likely xfs on ms-be1021 is OK: OK - load average: 11.40, 3.65, 1.28 https://wikitech.wikimedia.org/wiki/Swift [12:38:30] RECOVERY - swift-account-reaper on ms-be1021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper https://wikitech.wikimedia.org/wiki/Swift [12:38:30] RECOVERY - swift-object-server on ms-be1021 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server https://wikitech.wikimedia.org/wiki/Swift [12:38:36] RECOVERY - swift-account-replicator on ms-be1021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator https://wikitech.wikimedia.org/wiki/Swift [12:38:44] RECOVERY - swift-container-replicator on ms-be1021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator https://wikitech.wikimedia.org/wiki/Swift [12:39:24] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1021 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:40:58] Hallo. Is there a way to reset a Wikimedia user's password in production if they don't have an email? There is "resetUserEmail.php", but I don't know whether it's possible to run it in production. [12:41:16] RECOVERY - DPKG on ms-be1021 is OK: All packages OK [12:41:18] RECOVERY - puppet last run on ms-be1021 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:41:28] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: more robust swap deletion operations [puppet] - 10https://gerrit.wikimedia.org/r/520230 (https://phabricator.wikimedia.org/T215531) [12:41:33] (03CR) 10Marostegui: "> > Patch Set 11:" [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) (owner: 10Jcrespo) [12:43:00] aharoni48: https://wikitech.wikimedia.org/wiki/Password_reset [12:43:10] RECOVERY - Disk space on ms-be1021 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [12:44:34] amir1: thanks! [12:46:37] 10Operations, 10Wikimedia-Logstash, 10Goal, 10User-fgiunchedi: Deprecate all non-Kafka logstash inputs - https://phabricator.wikimedia.org/T227080 (10fgiunchedi) [12:46:53] (03PS1) 10Bstorm: toolforge: make kubeadm config allow stacked control plane [puppet] - 10https://gerrit.wikimedia.org/r/520231 (https://phabricator.wikimedia.org/T215531) [12:47:28] !log Upgrade db2082 - T227062 [12:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:34] T227062: Failover s8 (wikidatawiki) db primary master db1071 to db1104 (read-only required) - https://phabricator.wikimedia.org/T227062 [12:47:50] (03CR) 10jerkins-bot: [V: 04-1] toolforge: make kubeadm config allow stacked control plane [puppet] - 10https://gerrit.wikimedia.org/r/520231 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [12:48:17] (03CR) 10Alexandros Kosiaris: [C: 03+1] blubberoid: Add policy file [deployment-charts] - 10https://gerrit.wikimedia.org/r/517573 (https://phabricator.wikimedia.org/T215319) (owner: 10Thcipriani) [12:49:04] (03CR) 10Bstorm: [C: 03+1] "Wow. Is it a lot worse to create a new swap-less image? I'm thinking it probably is, but, I had to ask." [puppet] - 10https://gerrit.wikimedia.org/r/520230 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [12:49:06] 10Operations, 10Wikimedia-Logstash, 10Goal, 10User-fgiunchedi: Deprecate all non-Kafka logstash inputs - https://phabricator.wikimedia.org/T227080 (10fgiunchedi) [12:49:08] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi, 10User-herron: Increase utilization of application logging pipeline (FY2018-2019 Q3 TEC6) - https://phabricator.wikimedia.org/T213157 (10fgiunchedi) [12:49:11] 10Operations, 10Analytics, 10Wikimedia-Logstash, 10observability: Retire udp2log: onboard its producers and consumers to the logging pipeline - https://phabricator.wikimedia.org/T205856 (10fgiunchedi) [12:49:54] (03PS2) 10Arturo Borrero Gonzalez: toolforge: k8s: more robust swap deletion operations [puppet] - 10https://gerrit.wikimedia.org/r/520230 (https://phabricator.wikimedia.org/T215531) [12:50:00] (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+2] role::puppetmaster::puppetdb Add postgres_exporter [puppet] - 10https://gerrit.wikimedia.org/r/520223 (owner: 10Effie Mouzeli) [12:50:12] (03PS3) 10Effie Mouzeli: role::puppetmaster::puppetdb Add postgres_exporter [puppet] - 10https://gerrit.wikimedia.org/r/520223 [12:51:06] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+1] blubberoid: Add policy file [deployment-charts] - 10https://gerrit.wikimedia.org/r/517573 (https://phabricator.wikimedia.org/T215319) (owner: 10Thcipriani) [12:51:08] 10Operations, 10Wikimedia-Logstash, 10Goal, 10User-fgiunchedi: TEC6: Logging infrastructure (Q4 2018/19 goal) - https://phabricator.wikimedia.org/T220103 (10fgiunchedi) [12:51:12] 10Operations, 10Wikimedia-Logstash, 10Goal, 10User-fgiunchedi: Deprecate all non-Kafka logstash inputs - https://phabricator.wikimedia.org/T227080 (10fgiunchedi) [12:51:32] !log push RPKI classification to AMS - T220669 [12:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:38] T220669: RPKI Validation - https://phabricator.wikimedia.org/T220669 [12:51:48] PROBLEM - MariaDB Slave IO: s8 on db2094 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2082.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2082.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [12:51:58] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi, 10User-herron: Increase utilization of application logging pipeline (FY2018-2019 Q3 TEC6) - https://phabricator.wikimedia.org/T213157 (10fgiunchedi) [12:52:00] 10Operations, 10Wikimedia-Logstash, 10Goal, 10User-fgiunchedi: Deprecate all non-Kafka logstash inputs - https://phabricator.wikimedia.org/T227080 (10fgiunchedi) [12:52:02] (03PS2) 10Bstorm: toolforge: make kubeadm config allow stacked control plane [puppet] - 10https://gerrit.wikimedia.org/r/520231 (https://phabricator.wikimedia.org/T215531) [12:52:03] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs - https://phabricator.wikimedia.org/T213899 (10fgiunchedi) [12:52:28] (03CR) 10jerkins-bot: [V: 04-1] toolforge: make kubeadm config allow stacked control plane [puppet] - 10https://gerrit.wikimedia.org/r/520231 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [12:53:03] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi, 10User-herron: Increase utilization of application logging pipeline (FY2018-2019 Q3 TEC6) - https://phabricator.wikimedia.org/T213157 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Resolving, related tasks have been reparented. [12:53:40] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi: Deprecate all non-Kafka logstash inputs - https://phabricator.wikimedia.org/T227080 (10fgiunchedi) [12:54:20] (03CR) 10Arturo Borrero Gonzalez: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/520230 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [12:54:29] 10Operations, 10vm-requests, 10cloud-services-team (Kanban): Three small ganeti VMs to host haproxy for OpenStack endpoints - https://phabricator.wikimedia.org/T227041 (10akosiaris) Sounds fine to me. Please use `row_A` in eqiad for this as it has more resources available. Also, I guess all three VMs will ha... [12:54:31] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi: Deprecate all non-Kafka logstash inputs - https://phabricator.wikimedia.org/T227080 (10fgiunchedi) [12:54:34] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Migrate >=90% of existing Logstash traffic to the logging pipeline - https://phabricator.wikimedia.org/T205851 (10fgiunchedi) [12:54:36] 10Operations, 10Wikimedia-Logstash, 10service-runner, 10Core Platform Team Kanban (Done with CPT), and 2 others: Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 (10fgiunchedi) [12:54:55] (03PS3) 10Bstorm: toolforge: make kubeadm config allow stacked control plane [puppet] - 10https://gerrit.wikimedia.org/r/520231 (https://phabricator.wikimedia.org/T215531) [12:55:19] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2022.codfw.wmnet'] ` and were **ALL** successful. [12:55:25] (03PS3) 10Arturo Borrero Gonzalez: toolforge: k8s: more robust swap deletion operations [puppet] - 10https://gerrit.wikimedia.org/r/520230 (https://phabricator.wikimedia.org/T215531) [12:56:10] RECOVERY - Check the NTP synchronisation status of timesyncd on ms-be1021 is OK: OK: synced at Tue 2019-07-02 12:56:08 UTC. [12:56:17] 10Operations, 10Analytics, 10Traffic: Size of headers processed by varnish? - https://phabricator.wikimedia.org/T198152 (10ema) >>! In T198152#4324161, @ema wrote: > Both [[ https://varnish-cache.org/docs/5.1/reference/varnishd.html#http-req-hdr-len | varnish ]] and [[http://nginx.org/en/docs/http/ngx_http_c... [12:56:24] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: more robust swap deletion operations [puppet] - 10https://gerrit.wikimedia.org/r/520230 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [12:56:28] (03PS4) 10Bstorm: toolforge: make kubeadm config allow stacked control plane [puppet] - 10https://gerrit.wikimedia.org/r/520231 (https://phabricator.wikimedia.org/T215531) [12:58:37] (03PS4) 10Filippo Giunchedi: rsyslog: use named actions for central syslog hosts [puppet] - 10https://gerrit.wikimedia.org/r/520012 (https://phabricator.wikimedia.org/T226703) [12:59:04] RECOVERY - MariaDB Slave IO: s8 on db2094 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [13:02:57] (03CR) 10Herron: [C: 03+1] "LGTM! Scope-wise it probably justifies temporarily disabling puppet and deploying to a subset of "canary" hosts before deploying across t" [puppet] - 10https://gerrit.wikimedia.org/r/520012 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [13:03:51] (03PS2) 10ArielGlenn: Wikidata dumps: Update minimum expected sizes [puppet] - 10https://gerrit.wikimedia.org/r/519493 (https://phabricator.wikimedia.org/T226601) (owner: 10Hoo man) [13:04:54] (03CR) 10ArielGlenn: [C: 03+2] Wikidata dumps: Update minimum expected sizes [puppet] - 10https://gerrit.wikimedia.org/r/519493 (https://phabricator.wikimedia.org/T226601) (owner: 10Hoo man) [13:05:47] (03PS2) 10ArielGlenn: dumpwikidatajson: Fix error code detection [puppet] - 10https://gerrit.wikimedia.org/r/519494 (https://phabricator.wikimedia.org/T226601) (owner: 10Hoo man) [13:06:37] (03PS3) 10Jcrespo: WMFReplication: Make move work for a limited number of cases [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/517794 [13:06:53] !log pool cp2022 w/ ATS backend T226637 [13:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:58] T226637: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 [13:07:02] (03CR) 10jerkins-bot: [V: 04-1] WMFReplication: Make move work for a limited number of cases [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/517794 (owner: 10Jcrespo) [13:07:15] (03CR) 10ArielGlenn: [C: 03+2] dumpwikidatajson: Fix error code detection [puppet] - 10https://gerrit.wikimedia.org/r/519494 (https://phabricator.wikimedia.org/T226601) (owner: 10Hoo man) [13:07:32] (03CR) 10Jcrespo: "This change is ready for review." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/517794 (owner: 10Jcrespo) [13:09:00] 10Operations, 10MediaWiki-Debug-Logger, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: MediaWiki logging & encryption - https://phabricator.wikimedia.org/T126989 (10fgiunchedi) I believe with {T183303} now resolved all mw logs should be encrypted except for udp2log! [13:09:45] !log push RPKI classification to eqsin - T220669 [13:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:52] T220669: RPKI Validation - https://phabricator.wikimedia.org/T220669 [13:11:28] 10Operations, 10LDAP-Access-Requests: Grant WMDE engineers access to logstash and creating grafana boards / Add WMDE engineers to 'nda' LDAP group - https://phabricator.wikimedia.org/T225004 (10WMDE-leszek) That sounds good @MoritzMuehlenhoff, thanks! [13:13:22] !log push RPKI classification to eqiad - T220669 [13:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:48] (03PS1) 10Bstorm: toolforge: add calico CRD config [puppet] - 10https://gerrit.wikimedia.org/r/520233 (https://phabricator.wikimedia.org/T215531) [13:15:04] (03PS5) 10Filippo Giunchedi: rsyslog: use named actions for central syslog hosts [puppet] - 10https://gerrit.wikimedia.org/r/520012 (https://phabricator.wikimedia.org/T226703) [13:15:40] (03PS1) 10Awight: Enable experimental FileImporter features on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520234 (https://phabricator.wikimedia.org/T225617) [13:17:30] (03CR) 10Jbond: "I have now updated the private repo[1] and PCC is running with still no real changes, the only changes listed are re-ordering a hash. As " [puppet] - 10https://gerrit.wikimedia.org/r/511686 (owner: 10Jbond) [13:18:10] (03CR) 10Filippo Giunchedi: "> Patch Set 4: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/520012 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [13:19:16] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] toolforge: make kubeadm config allow stacked control plane [puppet] - 10https://gerrit.wikimedia.org/r/520231 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [13:19:45] 10Operations, 10Analytics, 10Wikimedia-Incident: Move icinga alarm for the EventStreams external endpoint to SRE - https://phabricator.wikimedia.org/T227065 (10Ottomata) +1 I think this alarm should alert SRE. [13:20:28] (03PS1) 10Ema: Revert "Increase client header buffer size, to allow for larger sets of cookies." [puppet] - 10https://gerrit.wikimedia.org/r/520235 (https://phabricator.wikimedia.org/T198152) [13:22:02] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi: Deprecate all non-Kafka logstash inputs - https://phabricator.wikimedia.org/T227080 (10fgiunchedi) [13:23:24] !log Upgrade db2085 [13:23:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:52] (03PS1) 10Ema: cache: reimage cp2024 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520236 (https://phabricator.wikimedia.org/T226637) [13:26:41] !log test fix policy ASXXX_in (missing `then next policy`) [13:26:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:12] !log depool cp2024 and reimage as upload_ats T226637 [13:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:17] T226637: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 [13:27:31] (03CR) 10Ema: [C: 03+2] cache: reimage cp2024 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520236 (https://phabricator.wikimedia.org/T226637) (owner: 10Ema) [13:30:28] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: add calico yaml config [puppet] - 10https://gerrit.wikimedia.org/r/520238 (https://phabricator.wikimedia.org/T215531) [13:31:55] !log Upgrade db2086 [13:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:58] (03CR) 10WMDE-Fisch: [C: 03+1] Enable experimental FileImporter features on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520234 (https://phabricator.wikimedia.org/T225617) (owner: 10Awight) [13:33:11] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp2024.codfw.wmnet'] ` The log can be found in `... [13:33:12] PROBLEM - puppet last run on wtp1026 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:33:43] (03CR) 10Ema: [C: 03+1] Decommission cp3037 [puppet] - 10https://gerrit.wikimedia.org/r/520218 (https://phabricator.wikimedia.org/T227077) (owner: 10Muehlenhoff) [13:37:50] RECOVERY - Disk space on notebook1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [13:38:46] (03PS5) 10Bstorm: toolforge: make kubeadm config allow stacked control plane [puppet] - 10https://gerrit.wikimedia.org/r/520231 (https://phabricator.wikimedia.org/T215531) [13:40:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] toolforge: add calico CRD config [puppet] - 10https://gerrit.wikimedia.org/r/520233 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [13:40:52] (03CR) 10Bstorm: [C: 03+2] toolforge: make kubeadm config allow stacked control plane [puppet] - 10https://gerrit.wikimedia.org/r/520231 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [13:43:13] 10Operations, 10MediaWiki-Debug-Logger, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: MediaWiki logging & encryption - https://phabricator.wikimedia.org/T126989 (10Ottomata) Hm, I don't think T183303 affected any encryption status of logs. The Avro logs we migrated to event gate just do an... [13:48:36] (03CR) 10BBlack: [C: 03+1] Revert "Increase client header buffer size, to allow for larger sets of cookies." [puppet] - 10https://gerrit.wikimedia.org/r/520235 (https://phabricator.wikimedia.org/T198152) (owner: 10Ema) [13:49:23] (03CR) 10BBlack: [C: 03+1] lvs: alert on excessive network rx traffic [puppet] - 10https://gerrit.wikimedia.org/r/520160 (owner: 10CDanis) [13:52:29] (03PS5) 10CDanis: lvs: alert on excessive network rx traffic [puppet] - 10https://gerrit.wikimedia.org/r/520160 [13:52:35] (03CR) 10CDanis: [C: 03+2] lvs: alert on excessive network rx traffic [puppet] - 10https://gerrit.wikimedia.org/r/520160 (owner: 10CDanis) [13:56:03] RECOVERY - snapshot of s3 in codfw on db1115 is OK: snapshot for s3 at codfw taken less than 4 days ago and larger than 90 GB: Last one 2019-07-02 10:06:59 from db2098.codfw.wmnet:3313 (746 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [13:57:07] (03PS30) 10BBlack: Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723 (owner: 10Ayounsi) [13:59:00] 10Operations, 10MediaWiki-Debug-Logger, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: MediaWiki logging & encryption - https://phabricator.wikimedia.org/T126989 (10fgiunchedi) >>! In T126989#5300054, @Ottomata wrote: > Hm, I don't think T183303 affected any encryption status of logs. The Av... [14:00:47] (03PS2) 10Bstorm: toolforge: add calico CRD config [puppet] - 10https://gerrit.wikimedia.org/r/520233 (https://phabricator.wikimedia.org/T215531) [14:02:02] (03CR) 10Bstorm: "The dns domain param wasn't liked by 1.15, so I've removed that from the class." [puppet] - 10https://gerrit.wikimedia.org/r/520233 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [14:02:10] !log deploying anycast_healthchecker changes to the recdnses (puppet disabled on all, testing dns4002 first) - https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/397723/ [14:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:28] (03CR) 10Hoo man: refactor wikidata entity dumps into wikibase + wikidata specific bits (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/517670 (https://phabricator.wikimedia.org/T221917) (owner: 10ArielGlenn) [14:02:33] (03CR) 10BBlack: [C: 03+2] Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723 (owner: 10Ayounsi) [14:07:10] !log otto@deploy1001 Started deploy [eventstreams/deploy@de1d356]: Limit concurrent number of connections per X-Client-IP - T226808 [14:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:15] T226808: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 [14:09:44] *** I will restart apache2 on phab1003 aka phabricator.wikimedia.org at 14:25 *** [14:10:24] (03PS4) 10BPirkle: Add kask session storage configuration. Use only on testwiki, [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519432 (https://phabricator.wikimedia.org/T222099) [14:11:05] PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs2003 is CRITICAL: (null) https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs2003&var-datasource=codfw+prometheus/ops [14:11:39] heh [14:12:12] (03PS4) 10Jcrespo: WMFReplication: Make move work for a limited number of cases [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/517794 [14:12:23] oh wow there it is [14:12:34] (03CR) 10jerkins-bot: [V: 04-1] WMFReplication: Make move work for a limited number of cases [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/517794 (owner: 10Jcrespo) [14:12:37] (03PS1) 10CDanis: excessive-lvs-rx-traffic: fix quoting error [puppet] - 10https://gerrit.wikimedia.org/r/520245 [14:12:43] PROBLEM - puppet last run on cp2024 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[nginx] [14:13:21] the cp2024 alert is due to my reimage, ignore ^ [14:13:27] !log otto@deploy1001 Finished deploy [eventstreams/deploy@de1d356]: Limit concurrent number of connections per X-Client-IP - T226808 (duration: 06m 17s) [14:13:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:33] T226808: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 [14:13:55] (03PS5) 10Jcrespo: WMFReplication: Make move work for a limited number of cases [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/517794 [14:14:12] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2024.codfw.wmnet'] ` and were **ALL** successful. [14:14:19] (03CR) 10jerkins-bot: [V: 04-1] WMFReplication: Make move work for a limited number of cases [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/517794 (owner: 10Jcrespo) [14:15:15] PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs2006 is CRITICAL: (null) https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs2006&var-datasource=codfw+prometheus/ops [14:15:40] 10Operations, 10Analytics, 10Security, 10Services (watching), 10Wikimedia-Incident: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Ottomata) Hm, I just noticed there are more eventstreams processors than I had thought.... [14:17:51] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Enable DataBridge on Beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519987 (https://phabricator.wikimedia.org/T226816) (owner: 10Matthias Geisler) [14:17:58] RECOVERY - puppet last run on cp2024 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:18:57] !log pool cp2024 w/ ATS backend T226637 [14:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:03] T226637: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 [14:19:42] RECOVERY - puppet last run on wtp1026 is OK: OK: Puppet is currently enabled, last run 20 minutes ago with 0 failures [14:19:42] 10Operations, 10Performance-Team, 10TechCom-RFC, 10Traffic, and 4 others: Serve Main Page of WMF wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Krinkle) @Ladsgroup "[Under discussion](https://www.mediawiki.org/wiki/Requests_for_comment/Process#Review_process)" here merely means... [14:19:46] (03PS2) 10Ema: Revert "Increase client header buffer size, to allow for larger sets of cookies." [puppet] - 10https://gerrit.wikimedia.org/r/520235 (https://phabricator.wikimedia.org/T198152) [14:20:18] PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs2004 is CRITICAL: (null) https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs2004&var-datasource=codfw+prometheus/ops [14:21:50] (03CR) 10Ema: [C: 03+2] Revert "Increase client header buffer size, to allow for larger sets of cookies." [puppet] - 10https://gerrit.wikimedia.org/r/520235 (https://phabricator.wikimedia.org/T198152) (owner: 10Ema) [14:22:02] !log add DNS anycast BGP statement to cr3-ulsfo [14:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:06] PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs3003 is CRITICAL: (null) https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3003&var-datasource=esams+prometheus/ops [14:22:30] (03PS2) 10CDanis: excessive-lvs-rx-traffic: fix quoting error [puppet] - 10https://gerrit.wikimedia.org/r/520245 [14:23:06] PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs5003 is CRITICAL: (null) https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs5003&var-datasource=eqsin+prometheus/ops [14:23:50] PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs4006 is CRITICAL: (null) https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs4006&var-datasource=ulsfo+prometheus/ops [14:25:29] !log restart apache2 on phab1003 [14:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:18] PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs1014 is CRITICAL: (null) https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs1014&var-datasource=eqiad+prometheus/ops [14:29:26] PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs2005 is CRITICAL: (null) https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs2005&var-datasource=codfw+prometheus/ops [14:29:26] PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs2001 is CRITICAL: (null) https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs2001&var-datasource=codfw+prometheus/ops [14:30:10] PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs3001 is CRITICAL: (null) https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3001&var-datasource=esams+prometheus/ops [14:30:23] ah. ic [14:31:43] (03CR) 10CDanis: [C: 03+2] excessive-lvs-rx-traffic: fix quoting error [puppet] - 10https://gerrit.wikimedia.org/r/520245 (owner: 10CDanis) [14:31:57] (03PS1) 10Elukey: profile::eventstreams: add X-Client-IP to the health check [puppet] - 10https://gerrit.wikimedia.org/r/520247 (https://phabricator.wikimedia.org/T226808) [14:32:18] PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs3004 is CRITICAL: (null) https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3004&var-datasource=esams+prometheus/ops [14:32:38] (03PS2) 10Elukey: profile::eventstreams: add X-Client-IP to the health check [puppet] - 10https://gerrit.wikimedia.org/r/520247 (https://phabricator.wikimedia.org/T226808) [14:33:53] (03CR) 10ArielGlenn: refactor wikidata entity dumps into wikibase + wikidata specific bits (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/517670 (https://phabricator.wikimedia.org/T221917) (owner: 10ArielGlenn) [14:34:59] (03PS3) 10Ottomata: profile::eventstreams: add X-Client-IP to the health check [puppet] - 10https://gerrit.wikimedia.org/r/520247 (https://phabricator.wikimedia.org/T226808) (owner: 10Elukey) [14:36:01] (03PS4) 10Ottomata: profile::eventstreams: add X-Client-IP to the health check [puppet] - 10https://gerrit.wikimedia.org/r/520247 (https://phabricator.wikimedia.org/T226808) (owner: 10Elukey) [14:36:52] (03CR) 10Ottomata: [C: 03+2] profile::eventstreams: add X-Client-IP to the health check [puppet] - 10https://gerrit.wikimedia.org/r/520247 (https://phabricator.wikimedia.org/T226808) (owner: 10Elukey) [14:37:09] 10Operations, 10MobileFrontend, 10TechCom-RFC, 10Traffic, 10Readers-Web-Backlog (Tracking): Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Krinkle) >>! In T214998#5298484, @DIKW_Pyramid wrote: > If I use desktop browser - how I... [14:37:34] PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs2002 is CRITICAL: (null) https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs2002&var-datasource=codfw+prometheus/ops [14:38:01] (03PS3) 10Bstorm: toolforge: add calico CRD config [puppet] - 10https://gerrit.wikimedia.org/r/520233 (https://phabricator.wikimedia.org/T215531) [14:39:54] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [14:40:26] 10Operations, 10ops-codfw: codfw humidity too high - https://phabricator.wikimedia.org/T225137 (10Papaul) a:05Papaul→03RobH @RobH i checked this morning the graph from Librenms, it looks like the humidity is back to normal. Please double check and see and close the task. We can reopen it anytime. Thanks. [14:41:42] RECOVERY - Excessive RX traffic on an LVS -units megabits/sec- on lvs2001 is OK: (C)3200 ge (W)1600 ge 22.87 https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs2001&var-datasource=codfw+prometheus/ops [14:41:49] apologies for the Excessive RX traffic alerts [14:42:40] RECOVERY - Excessive RX traffic on an LVS -units megabits/sec- on lvs1014 is OK: (C)3200 ge (W)1600 ge 195.2 https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs1014&var-datasource=eqiad+prometheus/ops [14:42:40] RECOVERY - Excessive RX traffic on an LVS -units megabits/sec- on lvs2004 is OK: (C)3200 ge (W)1600 ge 0.155 https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs2004&var-datasource=codfw+prometheus/ops [14:42:52] PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs5002 is CRITICAL: (null) https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs5002&var-datasource=eqsin+prometheus/ops [14:42:52] PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs1016 is CRITICAL: (null) https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs1016&var-datasource=eqiad+prometheus/ops [14:44:14] 10Operations, 10MediaWiki-Maintenance-scripts, 10Patch-For-Review, 10Performance-Team (Radar): cron spam for slow queries on mwmaint /usr/local/bin/foreachwiki initSiteStats.php --update > /dev/null - https://phabricator.wikimedia.org/T216243 (10Krinkle) [14:44:20] 10Operations, 10Patch-For-Review: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10Krinkle) [14:44:23] 10Operations, 10MediaWiki-Maintenance-scripts, 10Patch-For-Review, 10Performance-Team (Radar): cron spam for slow queries on mwmaint /usr/local/bin/foreachwiki initSiteStats.php --update > /dev/null - https://phabricator.wikimedia.org/T216243 (10Krinkle) 05Open→03Declined [14:47:00] !log add anycast BGP statement to eqsin [14:47:04] PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs5001 is CRITICAL: (null) https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs5001&var-datasource=eqsin+prometheus/ops [14:47:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:08] RECOVERY - Excessive RX traffic on an LVS -units megabits/sec- on lvs2002 is OK: (C)3200 ge (W)1600 ge 31.54 https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs2002&var-datasource=codfw+prometheus/ops [14:47:08] RECOVERY - Excessive RX traffic on an LVS -units megabits/sec- on lvs2005 is OK: (C)3200 ge (W)1600 ge 0.209 https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs2005&var-datasource=codfw+prometheus/ops [14:47:16] RECOVERY - Excessive RX traffic on an LVS -units megabits/sec- on lvs1016 is OK: (C)3200 ge (W)1600 ge 14.77 https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs1016&var-datasource=eqiad+prometheus/ops [14:47:16] RECOVERY - Excessive RX traffic on an LVS -units megabits/sec- on lvs5002 is OK: (C)3200 ge (W)1600 ge 110.4 https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs5002&var-datasource=eqsin+prometheus/ops [14:48:12] (03PS1) 10Ema: cache: reimage cp2025 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520250 (https://phabricator.wikimedia.org/T226637) [14:48:36] (03CR) 10Bstorm: [C: 03+2] toolforge: add calico CRD config [puppet] - 10https://gerrit.wikimedia.org/r/520233 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [14:49:21] RECOVERY - Excessive RX traffic on an LVS -units megabits/sec- on lvs5001 is OK: (C)3200 ge (W)1600 ge 159.1 https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs5001&var-datasource=eqsin+prometheus/ops [14:49:21] !log depool cp2025 and reimage as upload_ats T226637 [14:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:26] T226637: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 [14:49:46] (03PS2) 10Ema: cache: reimage cp2025 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520250 (https://phabricator.wikimedia.org/T226637) [14:50:55] (03CR) 10Ema: [C: 03+2] cache: reimage cp2025 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520250 (https://phabricator.wikimedia.org/T226637) (owner: 10Ema) [14:52:29] PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs3002 is CRITICAL: (null) https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3002&var-datasource=esams+prometheus/ops [14:54:19] PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs1015 is CRITICAL: (null) https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs1015&var-datasource=eqiad+prometheus/ops [14:54:58] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 46.67% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [14:55:06] RECOVERY - Excessive RX traffic on an LVS -units megabits/sec- on lvs3001 is OK: (C)3200 ge (W)1600 ge 317.9 https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3001&var-datasource=esams+prometheus/ops [14:55:36] RECOVERY - Excessive RX traffic on an LVS -units megabits/sec- on lvs1015 is OK: (C)3200 ge (W)1600 ge 562.2 https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs1015&var-datasource=eqiad+prometheus/ops [14:55:44] RECOVERY - Excessive RX traffic on an LVS -units megabits/sec- on lvs3004 is OK: (C)3200 ge (W)1600 ge 0.1654 https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3004&var-datasource=esams+prometheus/ops [14:56:02] RECOVERY - Excessive RX traffic on an LVS -units megabits/sec- on lvs5003 is OK: (C)3200 ge (W)1600 ge 0.1635 https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs5003&var-datasource=eqsin+prometheus/ops [14:56:28] 10Operations, 10LDAP-Access-Requests: Grant WMDE engineers access to logstash and creating grafana boards / Add WMDE engineers to 'nda' LDAP group - https://phabricator.wikimedia.org/T225004 (10MoritzMuehlenhoff) All the accounts listed in this task have been added to cn=nda, please let me know if there are a... [14:56:59] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp2025.codfw.wmnet'] ` The log can be found in `... [14:57:38] RECOVERY - Excessive RX traffic on an LVS -units megabits/sec- on lvs2003 is OK: (C)3200 ge (W)1600 ge 198.6 https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs2003&var-datasource=codfw+prometheus/ops [14:58:08] RECOVERY - Excessive RX traffic on an LVS -units megabits/sec- on lvs2006 is OK: (C)3200 ge (W)1600 ge 17.36 https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs2006&var-datasource=codfw+prometheus/ops [14:58:41] !log Run restart-php-fpm in all-mw-codfw - T223391 [14:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:47] T223391: Deploy Wikidiff2 version 1.8.2 with the timeout issue fixed - https://phabricator.wikimedia.org/T223391 [15:01:20] (03PS1) 10Krinkle: readme: Add link to new location [puppet/cdh] - 10https://gerrit.wikimedia.org/r/520254 (https://phabricator.wikimedia.org/T226474) [15:01:34] (03CR) 10jerkins-bot: [V: 04-1] readme: Add link to new location [puppet/cdh] - 10https://gerrit.wikimedia.org/r/520254 (https://phabricator.wikimedia.org/T226474) (owner: 10Krinkle) [15:01:49] (03CR) 10Krinkle: "I've temporary set the repo back from read-only to active to allow review/landing of this change." [puppet/cdh] - 10https://gerrit.wikimedia.org/r/520254 (https://phabricator.wikimedia.org/T226474) (owner: 10Krinkle) [15:02:09] 10Operations, 10WMDE-QWERTY-Team, 10serviceops, 10wikidiff2, and 3 others: Deploy Wikidiff2 version 1.8.2 with the timeout issue fixed - https://phabricator.wikimedia.org/T223391 (10jijiki) 05Open→03Resolved All service restarts should be complete in a few hours, please reopen if there are any issues :) [15:04:56] RECOVERY - Excessive RX traffic on an LVS -units megabits/sec- on lvs4006 is OK: (C)3200 ge (W)1600 ge 23.93 https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs4006&var-datasource=ulsfo+prometheus/ops [15:05:33] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10Papaul) [15:06:29] 10Operations, 10Release-Engineering-Team, 10Release-Engineering-Team-TODO, 10SRE-Access-Requests: Request access to deployment cluster for Alaa Sarhan - https://phabricator.wikimedia.org/T223698 (10greg) >>! In T223698#5281575, @jbond wrote: > @greg are you able to approve this request? Approved. (sorry f... [15:06:56] (03PS1) 10Andrew Bogott: Install/partman for cloudbackup2001/2002 [puppet] - 10https://gerrit.wikimedia.org/r/520257 (https://phabricator.wikimedia.org/T224528) [15:07:22] RECOVERY - Excessive RX traffic on an LVS -units megabits/sec- on lvs3003 is OK: (C)3200 ge (W)1600 ge 0.07974 https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3003&var-datasource=esams+prometheus/ops [15:10:03] (03CR) 10Krinkle: [V: 03+2] readme: Add link to new location [puppet/cdh] - 10https://gerrit.wikimedia.org/r/520254 (https://phabricator.wikimedia.org/T226474) (owner: 10Krinkle) [15:13:46] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [15:17:16] RECOVERY - Excessive RX traffic on an LVS -units megabits/sec- on lvs3002 is OK: (C)3200 ge (W)1600 ge 260.7 https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3002&var-datasource=esams+prometheus/ops [15:18:42] (03PS1) 10Muehlenhoff: Add two WMDE users to LDAP users list [puppet] - 10https://gerrit.wikimedia.org/r/520259 (https://phabricator.wikimedia.org/T225004) [15:18:55] (03CR) 10Andrew Bogott: [C: 03+2] Install/partman for cloudbackup2001/2002 [puppet] - 10https://gerrit.wikimedia.org/r/520257 (https://phabricator.wikimedia.org/T224528) (owner: 10Andrew Bogott) [15:19:10] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [15:19:17] (03CR) 10Elukey: [C: 03+2] readme: Add link to new location [puppet/cdh] - 10https://gerrit.wikimedia.org/r/520254 (https://phabricator.wikimedia.org/T226474) (owner: 10Krinkle) [15:19:23] (03CR) 10jerkins-bot: [V: 04-1] readme: Add link to new location [puppet/cdh] - 10https://gerrit.wikimedia.org/r/520254 (https://phabricator.wikimedia.org/T226474) (owner: 10Krinkle) [15:19:30] (03CR) 10jerkins-bot: [V: 04-1] Add two WMDE users to LDAP users list [puppet] - 10https://gerrit.wikimedia.org/r/520259 (https://phabricator.wikimedia.org/T225004) (owner: 10Muehlenhoff) [15:19:31] (03CR) 10Elukey: [C: 03+2] "Thanks a lot TImo! Is it ready to merge?" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/520254 (https://phabricator.wikimedia.org/T226474) (owner: 10Krinkle) [15:19:38] (03CR) 10jerkins-bot: [V: 04-1] readme: Add link to new location [puppet/cdh] - 10https://gerrit.wikimedia.org/r/520254 (https://phabricator.wikimedia.org/T226474) (owner: 10Krinkle) [15:20:19] !log Set repo back from active to read-only https://gerrit.wikimedia.org/r/#/admin/projects/operations/puppet/cdh (T226474)) [15:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:26] T226474: Archive cdh puppet submodule - https://phabricator.wikimedia.org/T226474 [15:20:40] (03PS1) 10Andrew Bogott: livirt: sort tls_allowed_dn_list [puppet] - 10https://gerrit.wikimedia.org/r/520261 (https://phabricator.wikimedia.org/T227060) [15:20:42] (03PS1) 10Andrew Bogott: nova-fullstack monitoring: allow for more leaks before we alert [puppet] - 10https://gerrit.wikimedia.org/r/520262 (https://phabricator.wikimedia.org/T227060) [15:21:09] (03PS2) 10Muehlenhoff: Add two WMDE users to LDAP users list [puppet] - 10https://gerrit.wikimedia.org/r/520259 (https://phabricator.wikimedia.org/T225004) [15:22:07] 10Operations, 10ops-codfw: codfw humidity too high - https://phabricator.wikimedia.org/T225137 (10ayounsi) LGTM, I re-enabled all the checks from T225137#5242532 [15:23:09] (03PS3) 10Muehlenhoff: Add two WMDE users to LDAP users list [puppet] - 10https://gerrit.wikimedia.org/r/520259 (https://phabricator.wikimedia.org/T225004) [15:23:36] (03CR) 10Andrew Bogott: "puppet compiler seems encouraging" [puppet] - 10https://gerrit.wikimedia.org/r/520261 (https://phabricator.wikimedia.org/T227060) (owner: 10Andrew Bogott) [15:24:33] 10Operations, 10Analytics, 10Analytics-Kanban, 10Cleanup, 10Patch-For-Review: Archive cdh puppet submodule - https://phabricator.wikimedia.org/T226474 (10Krinkle) On the GitHub mirror, I've closed outstanding pull requests on the mirror, and set the the "Archived" (read-only) flag on. This means it is st... [15:24:52] 10Operations, 10Analytics, 10Analytics-Kanban, 10Cleanup, 10Patch-For-Review: Archive cdh puppet submodule - https://phabricator.wikimedia.org/T226474 (10Krinkle) [15:25:21] (03CR) 10Muehlenhoff: [C: 03+2] Add two WMDE users to LDAP users list [puppet] - 10https://gerrit.wikimedia.org/r/520259 (https://phabricator.wikimedia.org/T225004) (owner: 10Muehlenhoff) [15:26:08] !log add centrallog1001 to routers ACLs - T226813 [15:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:14] T226813: Add centrallog1001 to syslog servers in network ACLs - https://phabricator.wikimedia.org/T226813 [15:28:30] 10Operations, 10observability: monitoring::check_prometheus should error on an unquoted ! in the query - https://phabricator.wikimedia.org/T227100 (10CDanis) [15:28:30] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [15:28:41] 10Operations, 10observability, 10good first bug: monitoring::check_prometheus should error on an unquoted ! in the query - https://phabricator.wikimedia.org/T227100 (10CDanis) p:05Triage→03Normal [15:28:55] 10Operations, 10vm-requests, 10cloud-services-team (Kanban): Three small ganeti VMs to host haproxy for OpenStack endpoints - https://phabricator.wikimedia.org/T227041 (10Andrew) >>! In T227041#5299907, @akosiaris wrote: > Sounds fine to me. Please use `row_A` in eqiad for this as it has more resources avail... [15:30:06] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi, 10User-herron: rack/setup/install centrallog1001.eqiad.wmnet - https://phabricator.wikimedia.org/T200706 (10ayounsi) [15:30:08] 10Operations, 10netops, 10User-fgiunchedi: Add centrallog1001 to syslog servers in network ACLs - https://phabricator.wikimedia.org/T226813 (10ayounsi) 05Open→03Resolved a:03ayounsi Done. Only needed in analytics and old labs filters. [15:31:14] (03CR) 10Lucas Werkmeister (WMDE): Specify $wgWBRepoSettings['conceptBaseUri'] again (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518239 (https://phabricator.wikimedia.org/T225212) (owner: 10Lucas Werkmeister (WMDE)) [15:32:29] 10Operations, 10Analytics, 10Security, 10Services (watching), 10Wikimedia-Incident: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Ottomata) Ah still wrong. Full details. The varnish limit is 25 per varnish-backend inst... [15:34:20] !log Add BGP to AS15830 in AMS-IX [15:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:36] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2025.codfw.wmnet'] ` and were **ALL** successful. [15:42:02] (03CR) 10Jhedden: [C: 03+2] livirt: sort tls_allowed_dn_list [puppet] - 10https://gerrit.wikimedia.org/r/520261 (https://phabricator.wikimedia.org/T227060) (owner: 10Andrew Bogott) [15:43:00] 10Operations, 10observability, 10Patch-For-Review: Icinga custom checks should follow our HTTP User-Agent policy - https://phabricator.wikimedia.org/T226508 (10jbond) If anyone can review ttps://gerrit.wikimedia.org/r/c/operations/puppet/+/519227 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/5192... [15:43:04] (03CR) 10Cwhite: [C: 03+1] "LGTM (untested)" [puppet] - 10https://gerrit.wikimedia.org/r/520012 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [15:43:36] (03CR) 10Jhedden: [C: 03+2] nova-fullstack monitoring: allow for more leaks before we alert [puppet] - 10https://gerrit.wikimedia.org/r/520262 (https://phabricator.wikimedia.org/T227060) (owner: 10Andrew Bogott) [15:43:45] !log "Equinix will be expanding the DA IX subnet from a /24 to a /23." (cf. email) [15:43:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:56] PROBLEM - Confd template for /etc/varnish/directors.backend.vcl on cp2025 is CRITICAL: NRPE: Command check_confd_etc_varnish_directors.backend.vcl not defined https://wikitech.wikimedia.org/wiki/Confd [15:45:04] PROBLEM - IPsec on cp2025 is CRITICAL: NRPE: Command check_IPsec not defined https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:45:34] PROBLEM - Varnish traffic logger - varnishstatsd on cp2025 is CRITICAL: NRPE: Command check_varnishstatsd not defined https://wikitech.wikimedia.org/wiki/Varnish [15:46:01] (03PS2) 10Andrew Bogott: livirt: sort tls_allowed_dn_list [puppet] - 10https://gerrit.wikimedia.org/r/520261 (https://phabricator.wikimedia.org/T227060) [15:46:28] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/519227 (https://phabricator.wikimedia.org/T226508) (owner: 10Jbond) [15:47:46] !log pool cp2025 w/ ATS backend T226637 [15:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:51] T226637: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 [15:50:12] (03PS2) 10Andrew Bogott: nova-fullstack monitoring: allow for more leaks before we alert [puppet] - 10https://gerrit.wikimedia.org/r/520262 (https://phabricator.wikimedia.org/T227060) [15:51:02] (03CR) 10Volans: "Just one question inline, looks good otherwise." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/519218 (https://phabricator.wikimedia.org/T226508) (owner: 10Jbond) [15:51:04] (03PS1) 10Bstorm: toolforge: add a join configuration to the init setup [puppet] - 10https://gerrit.wikimedia.org/r/520265 (https://phabricator.wikimedia.org/T215531) [15:53:32] (03PS1) 10Ema: cache: reimage cp2026 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520266 (https://phabricator.wikimedia.org/T226637) [15:54:39] (03PS1) 10Elukey: Move the zookeeper submodule into the repository - part 1 [puppet] - 10https://gerrit.wikimedia.org/r/520267 [15:55:40] (03PS2) 10Ema: cache: reimage cp2026 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520266 (https://phabricator.wikimedia.org/T226637) [15:55:41] !log depool cp2026 and reimage as upload_ats T226637 [15:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:46] T226637: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 [16:00:05] godog and _joe_: I, the Bot under the Fountain, allow thee, The Deployer, to do Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190702T1600). [16:00:05] No GERRIT patches in the queue for this window AFAICS. [16:00:11] 10Operations, 10ops-eqiad: rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10elukey) Thanks a lot for the task! So the hostnames should follow the new an- convention, I'd say an-zoo100[1-3]? Or possibly an-conf100[1-3]? (maybe more consistent with the actual conf[12]* n... [16:00:26] come on jenkins [16:01:05] (03CR) 10Ayounsi: [C: 03+2] GRE tunnel between eqiad and eqord [dns] - 10https://gerrit.wikimedia.org/r/517989 (https://phabricator.wikimedia.org/T226158) (owner: 10Ayounsi) [16:01:13] (03PS2) 10Ayounsi: GRE tunnel between eqiad and eqord [dns] - 10https://gerrit.wikimedia.org/r/517989 (https://phabricator.wikimedia.org/T226158) [16:01:17] (03CR) 10Ema: [C: 03+2] cache: reimage cp2026 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520266 (https://phabricator.wikimedia.org/T226637) (owner: 10Ema) [16:02:04] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/17194/ - looks good" [puppet] - 10https://gerrit.wikimedia.org/r/520267 (owner: 10Elukey) [16:04:48] ACKNOWLEDGEMENT - HP RAID on db2049 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:6 - OK: 1I:1:1, 1I:1:10, 1I:1:11, 1I:1:12, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:7, 1I:1:8, 1I:1:9 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T227107 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [16:04:52] 10Operations, 10ops-codfw: Degraded RAID on db2049 - https://phabricator.wikimedia.org/T227107 (10ops-monitoring-bot) [16:05:05] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi: Port varnishlog consumers to log to syslog / logging infra - https://phabricator.wikimedia.org/T227108 (10fgiunchedi) [16:07:10] 08Warning Alert for device cr2-esams.wikimedia.org - Memory over 85% [16:08:52] 10Operations, 10ops-eqiad, 10Analytics, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Ottomata) @Cmjohnson bump! [16:08:54] (03PS2) 10Elukey: Move the zookeeper submodule into the repository - part 1 [puppet] - 10https://gerrit.wikimedia.org/r/520267 [16:09:14] 10Operations, 10ops-eqiad, 10Analytics, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Ottomata) a:03Cmjohnson Feel free to reassign [16:09:58] PROBLEM - puppet last run on cp1080 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [16:11:26] PROBLEM - puppet last run on cp1086 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [16:11:26] (03PS1) 10Ema: Revert "cache: reimage cp2026 as upload_ats" [puppet] - 10https://gerrit.wikimedia.org/r/520269 (https://phabricator.wikimedia.org/T226637) [16:11:41] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi: Port varnishlog consumers to log to syslog / logging infra - https://phabricator.wikimedia.org/T227108 (10fgiunchedi) [16:12:23] (03PS2) 10Ema: Revert "cache: reimage cp2026 as upload_ats" [puppet] - 10https://gerrit.wikimedia.org/r/520269 (https://phabricator.wikimedia.org/T226637) [16:13:44] oh yeah, it's the empty-list problem I guess [16:14:05] for eqiad<->codfw stuff, which is a little different than the edge->core stuff [16:14:24] hmmmm [16:14:44] (03CR) 10Filippo Giunchedi: [C: 03+1] grafana: update script to manage certain fields [puppet] - 10https://gerrit.wikimedia.org/r/520040 (owner: 10Cwhite) [16:14:46] (eqiad wants a list of codfw varnish-backends to forward traffic to if it needs to, but there are none left!) [16:15:10] indeed [16:15:22] man jenkins is painfully slow [16:17:15] No manual entry for jenkins [16:17:47] godog wins [16:17:53] :D [16:17:54] godog: you should have said that at least after 6 minutes though [16:18:32] touché ema [16:19:15] !log add term allow-anycast-dns in filter labs-in4 [16:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:25] elukey: hehe I typed it in, don't know the output by <3 [16:23:22] (03PS6) 10Jbond: icinga user agent: add custom user agent to icing checks [puppet] - 10https://gerrit.wikimedia.org/r/519218 (https://phabricator.wikimedia.org/T226508) [16:23:47] (03CR) 10Ema: [V: 03+2 C: 03+2] Revert "cache: reimage cp2026 as upload_ats" [puppet] - 10https://gerrit.wikimedia.org/r/520269 (https://phabricator.wikimedia.org/T226637) (owner: 10Ema) [16:24:02] (03CR) 10Jbond: icinga user agent: add custom user agent to icing checks (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/519218 (https://phabricator.wikimedia.org/T226508) (owner: 10Jbond) [16:24:25] (03PS6) 10Jbond: icinga user agent: add custom user agent to icing checks [puppet] - 10https://gerrit.wikimedia.org/r/519227 (https://phabricator.wikimedia.org/T226508) [16:25:29] (03CR) 10CDanis: [C: 03+1] grafana: update script to manage certain fields (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/520040 (owner: 10Cwhite) [16:25:46] (03CR) 10Elukey: [C: 03+2] Move the zookeeper submodule into the repository - part 1 [puppet] - 10https://gerrit.wikimedia.org/r/520267 (owner: 10Elukey) [16:25:54] (03PS3) 10Elukey: Move the zookeeper submodule into the repository - part 1 [puppet] - 10https://gerrit.wikimedia.org/r/520267 [16:26:20] RECOVERY - puppet last run on cp1080 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [16:27:02] (03PS2) 10Filippo Giunchedi: hieradata: add netbox swift user [puppet] - 10https://gerrit.wikimedia.org/r/515058 [16:27:50] RECOVERY - puppet last run on cp1086 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:28:08] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: add netbox swift user [puppet] - 10https://gerrit.wikimedia.org/r/515058 (owner: 10Filippo Giunchedi) [16:29:51] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] hieradata: add netbox swift user [puppet] - 10https://gerrit.wikimedia.org/r/515058 (owner: 10Filippo Giunchedi) [16:30:38] !log CI code-review +2 changes are not quite processed for some unknown reason T227111 [16:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:43] T227111: Zuul is no longer adding +2ed changes to the gate-and-submit pipeline - https://phabricator.wikimedia.org/T227111 [16:31:37] !log depool dns2002 from recdns server for testing [16:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:44] !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=dns2002.wikimedia.org [16:31:47] hashar: i just did a rebase and didn;t get a recheck is it related to ^^ https://gerrit.wikimedia.org/r/c/operations/puppet/+/519227 [16:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:57] jbond42: yes [16:32:07] ack ok ill subscribe thanks [16:32:46] !log testing failure scenarios on dns2002, possible false-alarm alerts (depooled from LVS recdns) [16:32:48] 10Operations, 10ops-codfw, 10DBA: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10RobH) p:05Triage→03Normal [16:32:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:01] 10Operations, 10ops-codfw, 10DBA: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10RobH) [16:34:36] gerrit is showing "Cannot start command "gerrit query --format json --commit-message --current-patch-set change:I13f476d8126f81b0417e7509784c83d4f21cf348" for user jenkins-bot" hashar [16:35:03] bah :-( [16:35:06] (03CR) 10Elukey: [C: 03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/520267 (owner: 10Elukey) [16:35:14] (03PS4) 10Elukey: Move the zookeeper submodule into the repository - part 1 [puppet] - 10https://gerrit.wikimedia.org/r/520267 [16:35:20] paladox: can you comment on https://phabricator.wikimedia.org/T227111 ? [16:35:34] hashar i see a traceback to reviewdb [16:36:19] at least stream-events works [16:37:08] ohh [16:37:12] communication errors [16:37:29] gerrit lost mysql connection? [16:37:33] 10Operations, 10Wikimedia-Site-requests: Global rename of Fiona B. → Fiona*: supervision needed - https://phabricator.wikimedia.org/T224348 (10Itti) @jcrespo, thank you very much. I ask the user today for her ok, but she rejact the rename: https://de.wikipedia.org/w/index.php?title=Benutzerin_Diskussion:Fion... [16:37:43] i see: [16:37:44] Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure [16:37:44] The last packet successfully received from the server was 40 milliseconds ago. The last packet sent successfully to the server was 40 milliseconds ago. [16:37:44] at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) [16:38:13] im trying to paste the full traceback into the task [16:38:15] (03CR) 10Volans: "Reply inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/519218 (https://phabricator.wikimedia.org/T226508) (owner: 10Jbond) [16:40:01] https://phabricator.wikimedia.org/T227111#5300853 is the full traceback [16:40:47] 10Operations, 10Wikimedia-Site-requests: Global rename of Fiona B. → Fiona*: supervision needed - https://phabricator.wikimedia.org/T224348 (10Urbanecm) 05Open→03Invalid Per comment above. [16:47:46] so I think zuul lost its connection with GErrit [16:47:55] and thus no more listen to events :-\ [16:49:09] (03PS1) 10Elukey: Move the zookeeper submodule into the repository - part 2 [puppet] - 10https://gerrit.wikimedia.org/r/520271 (https://phabricator.wikimedia.org/T226466) [16:52:15] !log Stopping Jenkins and Zuul T227111 [16:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:20] T227111: Zuul is no longer adding jobs to any jenkins pipelines - https://phabricator.wikimedia.org/T227111 [16:53:49] !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=dns2002.wikimedia.org [16:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:04] (03CR) 10Elukey: "Will deploy when CI is back" [puppet] - 10https://gerrit.wikimedia.org/r/520267 (owner: 10Elukey) [16:54:46] something basically killed Zuul :/ [16:55:06] !log Starting Jenkins and Zuul T227111 [16:55:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:39] picking changes after re+2 again :) [16:58:44] 10Operations, 10Performance-Team, 10TechCom-RFC, 10Traffic, and 4 others: Serve Main Page of WMF wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Ladsgroup) Thanks. The naming confused me and sorry for the mess. IMO, there's three different questions that we should answer: - Sho... [16:59:09] \o/ [16:59:43] Lucas_WMDE: hauskatze paladox so yeah I have just restarted Zuul [16:59:56] !log CI is back, I had to restart Zuul :-\ T227111 [17:00:00] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/519227 (https://phabricator.wikimedia.org/T226508) (owner: 10Jbond) [17:00:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:02] T227111: Zuul is no longer adding jobs to any jenkins pipelines - https://phabricator.wikimedia.org/T227111 [17:00:04] cscott, arlolra, subbu, and halfak: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190702T1700). [17:00:13] ok :) [17:00:15] no parsoid deploys today [17:00:26] merci bien monsieur hashar [17:01:12] 10Operations, 10CX-cxserver, 10Citoid, 10Graphoid, and 10 others: Make services swagger specs standard compliant - https://phabricator.wikimedia.org/T218217 (10MSantos) [17:01:54] 10Operations, 10vm-requests, 10cloud-services-team (Kanban): Three small ganeti VMs to host haproxy for OpenStack endpoints - https://phabricator.wikimedia.org/T227041 (10Andrew) (Let's use Buster for this if it's available on ganeti) [17:03:04] (03CR) 10Jbond: [C: 03+2] icinga user agent: add custom user agent to icing checks [puppet] - 10https://gerrit.wikimedia.org/r/519227 (https://phabricator.wikimedia.org/T226508) (owner: 10Jbond) [17:03:16] (03PS7) 10Jbond: icinga user agent: add custom user agent to icing checks [puppet] - 10https://gerrit.wikimedia.org/r/519227 (https://phabricator.wikimedia.org/T226508) [17:07:14] 10Operations, 10Performance-Team, 10TechCom-RFC, 10Traffic, and 4 others: Serve Main Page of WMF wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10jcrespo) [17:14:27] ACKNOWLEDGEMENT - puppet last run on labmon1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 16 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[sqlite3-pcre] andrew bogott T227105 [17:14:27] ACKNOWLEDGEMENT - puppet last run on labmon1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 18 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[sqlite3-pcre] andrew bogott T227105 [17:19:03] (03PS1) 10Jcrespo: switchover.py: Add some extra automations to the script [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520280 [17:20:08] PROBLEM - Bird Internet Routing Daemon on dns2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast_recursive_DNS%23Bird_daemon_not_running [17:20:47] 10Operations, 10MobileFrontend, 10TechCom-RFC, 10Traffic, 10Readers-Web-Backlog (Tracking): Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10DIKW_Pyramid) >>! In T214998#5300227, @Krinkle wrote: >>>! In T214998#5298484, @DIKW_Pyra... [17:20:53] (03CR) 10jerkins-bot: [V: 04-1] switchover.py: Add some extra automations to the script [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520280 (owner: 10Jcrespo) [17:21:13] bblack: dns2001 alert is you? [17:23:08] PROBLEM - Bird Internet Routing Daemon on dns2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast_recursive_DNS%23Bird_daemon_not_running [17:23:11] (03PS2) 10Jcrespo: switchover.py: Add some extra automations to the script [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520280 [17:23:16] cdanis: yes [17:23:33] the bird stuff isn't in prod use yet, I'm testing failure scenarios stopping various bits [17:23:38] (03CR) 10jerkins-bot: [V: 04-1] switchover.py: Add some extra automations to the script [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520280 (owner: 10Jcrespo) [17:23:39] ahh okay [17:25:36] (03PS7) 10Jbond: icinga user agent: add custom user agent to icing checks [puppet] - 10https://gerrit.wikimedia.org/r/519218 (https://phabricator.wikimedia.org/T226508) [17:26:04] RECOVERY - Bird Internet Routing Daemon on dns2001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast_recursive_DNS%23Bird_daemon_not_running [17:26:06] RECOVERY - Bird Internet Routing Daemon on dns2002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast_recursive_DNS%23Bird_daemon_not_running [17:27:04] (03PS3) 10Isaac Johnson: Undeploy reader demographics surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519216 (https://phabricator.wikimedia.org/T226273) (owner: 10Bmansurov) [17:28:39] (03CR) 10Jbond: icinga user agent: add custom user agent to icing checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/519218 (https://phabricator.wikimedia.org/T226508) (owner: 10Jbond) [17:29:56] (03CR) 10Ayounsi: [C: 03+2] "recheck" [dns] - 10https://gerrit.wikimedia.org/r/517989 (https://phabricator.wikimedia.org/T226158) (owner: 10Ayounsi) [17:35:18] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 35.71% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [17:38:12] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [17:38:40] 10Operations, 10MobileFrontend, 10TechCom-RFC, 10Traffic, 10Readers-Web-Backlog (Tracking): Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Krinkle) >>>! In T214998#5300227, @Krinkle wrote: >>>>! In T214998#5298484, @DIKW_Pyramid... [17:39:08] (03PS1) 10Urbanecm: Rename `Image-reviewer` to `image-reviewer` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520283 [17:40:17] (03PS2) 10Urbanecm: Rename `Image-reviewer` to `image-reviewer` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520283 (https://phabricator.wikimedia.org/T216406) [17:42:08] (03PS1) 10BBlack: anycast recdns: test via resolv.conf on 5 hosts [puppet] - 10https://gerrit.wikimedia.org/r/520284 (https://phabricator.wikimedia.org/T186550) [17:43:34] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@9ca9b0f]: Update mobileapps to 941e14f (T219998 T217352 T219909) [17:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:42] T219998: mobile-html interaction handling - https://phabricator.wikimedia.org/T219998 [17:43:43] T219909: mobile-html section offsets - https://phabricator.wikimedia.org/T219909 [17:43:43] T217352: mobile-html page footer handling - https://phabricator.wikimedia.org/T217352 [17:46:15] (03CR) 10BBlack: "Compiler output looks legit!" [puppet] - 10https://gerrit.wikimedia.org/r/520284 (https://phabricator.wikimedia.org/T186550) (owner: 10BBlack) [17:49:24] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@9ca9b0f]: Update mobileapps to 941e14f (T219998 T217352 T219909) (duration: 05m 49s) [17:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:31] T219998: mobile-html interaction handling - https://phabricator.wikimedia.org/T219998 [17:49:31] T219909: mobile-html section offsets - https://phabricator.wikimedia.org/T219909 [17:49:32] T217352: mobile-html page footer handling - https://phabricator.wikimedia.org/T217352 [17:52:04] (03PS8) 10Jbond: icinga user agent: add custom user agent to icing checks [puppet] - 10https://gerrit.wikimedia.org/r/519218 (https://phabricator.wikimedia.org/T226508) [17:52:56] jbond42: you really don't like that previous version :-P I'll check it in a bit [17:53:25] volans: im not sure i like my version either tbh :D [17:54:05] LOL [17:57:46] 10Operations, 10Analytics, 10Security, 10Services (watching), 10Wikimedia-Incident: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Nuria) Per our conversation at standup we should probably have limits per host, does the... [18:00:14] 10Operations, 10MediaWiki-extensions-CentralAuth, 10Traffic, 10Performance-Team (Radar), and 2 others: Consistent HTTP 503 Varnish Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10BBlack) @Anomie / @Legoktm - Can you take a look at th... [18:00:50] (03CR) 10Volans: icinga user agent: add custom user agent to icing checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/519218 (https://phabricator.wikimedia.org/T226508) (owner: 10Jbond) [18:01:33] 10Operations, 10ops-eqiad: rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10Ottomata) I like an-conf. Also gives us the option to colocate something else on them if we need to one day. Thanks! [18:02:29] (03PS1) 10CDanis: Revert "grafana: also install sqlite3-pcre" [puppet] - 10https://gerrit.wikimedia.org/r/520288 (https://phabricator.wikimedia.org/T227105) [18:02:33] (03CR) 10Jcrespo: [C: 03+1] mariadb: Promote db1120 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/519185 (https://phabricator.wikimedia.org/T226358) (owner: 10Marostegui) [18:02:44] (03PS2) 10CDanis: Revert "grafana: also install sqlite3-pcre" [puppet] - 10https://gerrit.wikimedia.org/r/520288 (https://phabricator.wikimedia.org/T227105) [18:02:50] (03CR) 10Jcrespo: [C: 03+1] wmnet: Change x1-master to point to the new master [dns] - 10https://gerrit.wikimedia.org/r/519186 (https://phabricator.wikimedia.org/T226358) (owner: 10Marostegui) [18:03:10] 10Operations, 10Performance-Team, 10TechCom-RFC, 10Traffic, and 4 others: Serve Main Page of WMF wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Krinkle) >>! In T120085#5300982, @Ladsgroup wrote: > [..] there's three different questions that we should answer: > - Should we defin... [18:03:32] (03CR) 10Jcrespo: [C: 03+1] db-eqiad.php: Promote db1120 to x1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519187 (https://phabricator.wikimedia.org/T226358) (owner: 10Marostegui) [18:04:09] (03CR) 10CDanis: [C: 03+2] Revert "grafana: also install sqlite3-pcre" [puppet] - 10https://gerrit.wikimedia.org/r/520288 (https://phabricator.wikimedia.org/T227105) (owner: 10CDanis) [18:07:32] RECOVERY - puppet last run on labmon1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:07:32] !log setup tunnel between eqord and eqiad - T226158 [18:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:46] RECOVERY - puppet last run on labmon1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:12:08] 10Operations, 10Performance-Team, 10TechCom-RFC, 10Traffic, and 4 others: Serve Main Page of WMF wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Ladsgroup) >>! In T120085#5301256, @Krinkle wrote: > This is not required for the current RFC. MediaWiki supports the required function... [18:12:11] 08̶W̶a̶r̶n̶i̶n̶g Device cr2-esams.wikimedia.org recovered from Memory over 85% [18:14:30] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Looking at labs/private:" [puppet] - 10https://gerrit.wikimedia.org/r/511686 (owner: 10Jbond) [18:16:10] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Ottomata) [18:16:58] (03CR) 10MarcoAurelio: "Hmm. I'm seeing not only commons is using an Image-reviewer group; fawiki also uses it. Maybe we can do commons only for now and pick othe" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520283 (https://phabricator.wikimedia.org/T216406) (owner: 10Urbanecm) [18:23:31] 10Operations, 10Analytics, 10Analytics-Kanban: Terminate Wikimetrics - https://phabricator.wikimedia.org/T219446 (10Nuria) a:03mforns [18:23:49] 10Operations, 10Analytics, 10Analytics-Kanban: Terminate Wikimetrics - https://phabricator.wikimedia.org/T219446 (10Nuria) Moving to kanban to take care of this in Q1 2019 [18:24:38] (03PS3) 10Urbanecm: Rename `Image-reviewer` to `image-reviewer` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520283 (https://phabricator.wikimedia.org/T216406) [18:25:18] (03PS9) 10Jbond: icinga user agent: add custom user agent to icing checks [puppet] - 10https://gerrit.wikimedia.org/r/519218 (https://phabricator.wikimedia.org/T226508) [18:25:20] (03PS1) 10Jbond: python3/icinga check: refactor check to python3 [puppet] - 10https://gerrit.wikimedia.org/r/520290 [18:27:20] (03CR) 10Jbond: icinga user agent: add custom user agent to icing checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/519218 (https://phabricator.wikimedia.org/T226508) (owner: 10Jbond) [18:28:24] (03CR) 10Cwhite: [C: 03+2] grafana: update script to manage certain fields (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/520040 (owner: 10Cwhite) [18:28:31] (03PS1) 10Ladsgroup: labs: Temporary disable read new for wikibase term store of commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520292 (https://phabricator.wikimedia.org/T225053) [18:28:46] (03PS3) 10Cwhite: grafana: update script to manage certain fields [puppet] - 10https://gerrit.wikimedia.org/r/520040 [18:31:00] (03CR) 10Ladsgroup: [C: 03+2] "noop for prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520292 (https://phabricator.wikimedia.org/T225053) (owner: 10Ladsgroup) [18:31:53] (03Merged) 10jenkins-bot: labs: Temporary disable read new for wikibase term store of commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520292 (https://phabricator.wikimedia.org/T225053) (owner: 10Ladsgroup) [18:32:09] (03CR) 10jenkins-bot: labs: Temporary disable read new for wikibase term store of commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520292 (https://phabricator.wikimedia.org/T225053) (owner: 10Ladsgroup) [18:34:13] (03PS1) 10Paladox: Gerrit: Convert CoC and Privacy links to use the new PolyGerrit extension point [puppet] - 10https://gerrit.wikimedia.org/r/520295 [18:36:46] (03PS1) 10CRusnov: netbox: Add parameters and settings for storing things in Swift [puppet] - 10https://gerrit.wikimedia.org/r/520296 (https://phabricator.wikimedia.org/T209182) [18:40:15] (03PS2) 10Paladox: Gerrit: Convert CoC and Privacy links to use the new PolyGerrit extension point [puppet] - 10https://gerrit.wikimedia.org/r/520295 [18:40:29] (03PS1) 10CRusnov: netbox: add swift keys [labs/private] - 10https://gerrit.wikimedia.org/r/520297 [18:41:38] (03PS3) 10Paladox: Gerrit: Convert CoC and Privacy links to use the new PolyGerrit extension point [puppet] - 10https://gerrit.wikimedia.org/r/520295 [18:47:44] (03CR) 10Dzahn: [C: 03+1] netbox: add swift keys [labs/private] - 10https://gerrit.wikimedia.org/r/520297 (owner: 10CRusnov) [18:48:17] (03CR) 10CRusnov: [V: 03+2 C: 03+2] netbox: add swift keys [labs/private] - 10https://gerrit.wikimedia.org/r/520297 (owner: 10CRusnov) [18:51:05] * Krinkle is staging for wmf.11 on mwdebug1002 [18:54:08] (03PS4) 10Paladox: Gerrit: Convert CoC and Privacy links to use the new PolyGerrit extension point [puppet] - 10https://gerrit.wikimedia.org/r/520295 [18:55:14] (03PS5) 10Paladox: Gerrit: Convert CoC and Privacy links to use the new PolyGerrit extension point [puppet] - 10https://gerrit.wikimedia.org/r/520295 [18:56:06] (03PS4) 10Dzahn: phabricator: Manage ownership of /var/log/phd/ssh.log [puppet] - 10https://gerrit.wikimedia.org/r/519532 (https://phabricator.wikimedia.org/T224677) (owner: 1020after4) [18:57:59] !log mholloway-shell@deploy1001 Started deploy [recommendation-api/deploy@a29da76]: Update recommendation-api to 4f50c71 [18:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:32] (03PS6) 10Paladox: Gerrit: Convert CoC and Privacy links to use the new PolyGerrit extension point [puppet] - 10https://gerrit.wikimedia.org/r/520295 [19:00:02] (03CR) 10Paladox: "Tested locally and works, requires gerrit 2.15.15 so this change is on hold." [puppet] - 10https://gerrit.wikimedia.org/r/520295 (owner: 10Paladox) [19:00:49] !log mholloway-shell@deploy1001 Finished deploy [recommendation-api/deploy@a29da76]: Update recommendation-api to 4f50c71 (duration: 02m 50s) [19:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:56] (03CR) 10Dzahn: [C: 03+1] "/var/log/phd the directory does not exist yet. it needs to be defined separately before the file inside it." [puppet] - 10https://gerrit.wikimedia.org/r/519532 (https://phabricator.wikimedia.org/T224677) (owner: 1020after4) [19:02:27] 10Operations, 10ops-eqiad, 10DC-Ops: a8-eqiad pdu refresh - https://phabricator.wikimedia.org/T227133 (10RobH) [19:02:47] 10Operations, 10ops-eqiad, 10DC-Ops: install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH) [19:03:14] !log krinkle@deploy1001 scap sync-l10n completed (1.34.0-wmf.11) (duration: 00m 48s) [19:03:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:30] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.11/extensions/AbuseFilter/: 9963d843622b / T227095 (duration: 00m 51s) [19:05:41] 10Operations, 10ops-eqiad, 10DC-Ops: a8-eqiad pdu refresh - https://phabricator.wikimedia.org/T227133 (10RobH) [19:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:05] T227095: Fatal InvalidArgumentException on various Special:AbuseLog urls - https://phabricator.wikimedia.org/T227095 [19:07:25] !log krinkle@deploy1001 scap sync-l10n completed (1.34.0-wmf.11) (duration: 00m 47s) [19:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:37] (03PS1) 10Bstorm: toolforge: reload haproxy on subconfig changes [puppet] - 10https://gerrit.wikimedia.org/r/520303 (https://phabricator.wikimedia.org/T215531) [19:08:47] (03PS7) 10Paladox: Gerrit: Convert CoC and Privacy links to use the new PolyGerrit extension point [puppet] - 10https://gerrit.wikimedia.org/r/520295 [19:09:38] 10Operations, 10ops-eqiad: (Need By: June 30) rack/setup/install kafka-main100[1-5] - https://phabricator.wikimedia.org/T226274 (10wiki_willy) [19:12:23] (03PS8) 10Paladox: Gerrit: Convert CoC and Privacy links to use the new PolyGerrit extension point [puppet] - 10https://gerrit.wikimedia.org/r/520295 [19:18:50] !log krinkle@deploy1001 Started scap: l10n sync did not work as expected, try full scap to fix missing i18n message for 9963d843622 [19:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:13] (03PS10) 10Jbond: icinga user agent: add custom user agent to icing checks [puppet] - 10https://gerrit.wikimedia.org/r/519218 (https://phabricator.wikimedia.org/T226508) [19:28:53] 10Operations, 10MediaWiki-extensions-CentralAuth, 10Traffic, 10Performance-Team (Radar), and 2 others: Consistent HTTP 503 Varnish Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10Tgr) The cookie is output in `CentralAuthSessionProvid... [19:33:41] (03CR) 10Dzahn: [C: 04-1] "could you amend to add directory /var/log/phd first and then the file inside it with a dependency between them. and let's use 640 ?" [puppet] - 10https://gerrit.wikimedia.org/r/519532 (https://phabricator.wikimedia.org/T224677) (owner: 1020after4) [19:35:14] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 50.00, 20.97, 13.85 [19:36:08] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: restore shell access for dzahn - https://phabricator.wikimedia.org/T227052 (10Dzahn) Thanks! Confirmed working. I got on the bastions and other servers. [19:36:44] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 18.91, 18.45, 13.64 [19:37:15] !log krinkle@deploy1001 Finished scap: l10n sync did not work as expected, try full scap to fix missing i18n message for 9963d843622 (duration: 18m 24s) [19:37:26] (03PS1) 10Bstorm: toolforge: a firewall default of DENY is maddening for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/520306 (https://phabricator.wikimedia.org/T215531) [19:37:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:30] (03PS10) 10Cwhite: initial attempt at a varnishkafka exporter [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) [19:50:52] 10Operations, 10vm-requests, 10cloud-services-team (Kanban): Three small ganeti VMs to host haproxy for OpenStack endpoints - https://phabricator.wikimedia.org/T227041 (10akosiaris) >>! In T227041#5300492, @Andrew wrote: >>>! In T227041#5299907, @akosiaris wrote: >> Sounds fine to me. Please use `row_A` in e... [19:51:39] 10Operations, 10vm-requests, 10cloud-services-team (Kanban): Three small ganeti VMs to host haproxy for OpenStack endpoints - https://phabricator.wikimedia.org/T227041 (10akosiaris) >>! In T227041#5301002, @Andrew wrote: > (Let's use Buster for this if it's available on ganeti) It is, same as the rest of th... [19:53:07] (03PS2) 10Bstorm: toolforge: add a join configuration to the init setup [puppet] - 10https://gerrit.wikimedia.org/r/520265 (https://phabricator.wikimedia.org/T215531) [19:54:22] (03CR) 10Bstorm: [C: 03+2] toolforge: add a join configuration to the init setup [puppet] - 10https://gerrit.wikimedia.org/r/520265 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [19:55:05] (03CR) 10Cwhite: "Latest change downgrades prometheus-client to the version in stretch. There are subtle differences that broke tests so I've upgraded the " [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [19:56:39] (03CR) 10Bstorm: [C: 03+2] "Merging to move this forward, but see note!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/520303 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [19:56:56] (03PS2) 10Bstorm: toolforge: reload haproxy on subconfig changes [puppet] - 10https://gerrit.wikimedia.org/r/520303 (https://phabricator.wikimedia.org/T215531) [19:58:09] 10Operations, 10ops-eqiad, 10DC-Ops: a2-eqiad pdu refresh - https://phabricator.wikimedia.org/T227138 (10RobH) [19:59:25] 10Operations, 10ops-eqiad, 10DC-Ops: a2-eqiad pdu refresh - https://phabricator.wikimedia.org/T227138 (10RobH) [19:59:42] 10Operations, 10ops-eqiad, 10DC-Ops: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 (10RobH) [20:00:13] 10Operations, 10ops-eqiad, 10DC-Ops: a4-eqiad pdu refresh - https://phabricator.wikimedia.org/T227140 (10RobH) [20:00:14] (03PS2) 10Bstorm: toolforge: a firewall default of DENY is maddening for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/520306 (https://phabricator.wikimedia.org/T215531) [20:00:38] 10Operations, 10ops-eqiad, 10DC-Ops: a5-eqiad pdu refresh - https://phabricator.wikimedia.org/T227141 (10RobH) [20:00:58] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh - https://phabricator.wikimedia.org/T227142 (10RobH) [20:01:22] 10Operations, 10ops-eqiad, 10DC-Ops: a7-eqiad pdu refresh - https://phabricator.wikimedia.org/T227143 (10RobH) [20:02:08] 10Operations, 10ops-eqiad, 10DC-Ops: install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH) [20:03:46] (03CR) 10Bstorm: [C: 03+2] toolforge: a firewall default of DENY is maddening for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/520306 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [20:04:20] (03PS5) 1020after4: phabricator: Manage ownership of /var/log/phd/ssh.log [puppet] - 10https://gerrit.wikimedia.org/r/519532 (https://phabricator.wikimedia.org/T224677) [20:06:07] 10Operations, 10ops-eqiad, 10DC-Ops: a2-eqiad pdu refresh - https://phabricator.wikimedia.org/T227138 (10RobH) [20:06:16] 10Operations, 10ops-eqiad, 10DC-Ops: a2-eqiad pdu refresh - https://phabricator.wikimedia.org/T227138 (10RobH) p:05Triage→03Normal [20:06:38] (03PS1) 10Urbanecm: Remove expired throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520311 [20:06:43] 10Operations, 10ops-eqiad, 10DC-Ops: a2-eqiad pdu refresh - https://phabricator.wikimedia.org/T227138 (10RobH) [20:07:15] 10Operations, 10ops-eqiad, 10DC-Ops: a2-eqiad pdu refresh - https://phabricator.wikimedia.org/T227138 (10RobH) [20:08:19] 10Operations, 10ops-eqiad, 10DC-Ops: install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH) [20:13:41] !log smalyshev@deploy1001 Started deploy [wdqs/wdqs@cc60181]: Weekly WDQS deploy [20:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:23] (03PS2) 10CRusnov: netbox: Add parameters and settings for storing things in Swift [puppet] - 10https://gerrit.wikimedia.org/r/520296 (https://phabricator.wikimedia.org/T209182) [20:17:57] (03CR) 1020after4: "Actually, the directory is managed:" [puppet] - 10https://gerrit.wikimedia.org/r/519532 (https://phabricator.wikimedia.org/T224677) (owner: 1020after4) [20:20:04] !log contint1001 - temp installing parted for labeling new disks sdc and sdd for raid for docker images (T207707) [20:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:10] T207707: contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 [20:20:13] (03CR) 10CRusnov: "Compiler output looks good:" [puppet] - 10https://gerrit.wikimedia.org/r/520296 (https://phabricator.wikimedia.org/T209182) (owner: 10CRusnov) [20:21:30] (03PS6) 1020after4: phabricator: Manage ownership of /var/log/phd/ssh.log [puppet] - 10https://gerrit.wikimedia.org/r/519532 (https://phabricator.wikimedia.org/T224677) [20:27:12] (03CR) 10CDanis: [C: 03+1] "just a few nitpicks, thank you again for doing this" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/509365 (https://phabricator.wikimedia.org/T183177) (owner: 10Jbond) [20:28:24] !log smalyshev@deploy1001 Finished deploy [wdqs/wdqs@cc60181]: Weekly WDQS deploy (duration: 14m 43s) [20:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:02] !log contint1001 - created new partitions on /dev/sdc and /dev/sdd; created new RAID 1 over /dev/sdc1 and /dev/sdd1 [20:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:01] 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops, 10Release-Engineering-Team-TODO (201907): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10Dzahn) > RAID1 over the new disks /dev/sdc and /dev/sdd apt-get install parted parted... [20:37:31] 10Operations, 10ops-eqiad, 10DC-Ops: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 (10RobH) [20:39:19] 10Operations, 10ops-eqiad, 10DC-Ops: (July 22-26) install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH) [20:40:17] 10Operations, 10ops-eqiad, 10DC-Ops: (July 22-26) install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH) [21:10:23] 10Operations, 10ops-codfw: pull decom hardware and ship to Harry/OIT @ SF office - https://phabricator.wikimedia.org/T222383 (10HMarcus) 05Open→03Resolved [21:11:23] 10Operations, 10ops-eqiad, 10DC-Ops: a4-eqiad pdu refresh - https://phabricator.wikimedia.org/T227140 (10RobH) [21:15:54] hi all, we (OIT) will need a wildcard cert renewed by the tech-ops team. when creating the phabricator ticket, who should be tagged with this? thanks [21:19:10] hmarcus: #procurement [21:20:38] hmarcus: https://phabricator.wikimedia.org/T197840 is last year's, https://phabricator.wikimedia.org/T167346 is 2017's [21:20:42] if it helps :) [21:20:52] (03PS1) 10Bstorm: toolforge: refactor to join a node to the new cluster [puppet] - 10https://gerrit.wikimedia.org/r/520319 (https://phabricator.wikimedia.org/T215531) [21:21:19] that's perfect, thanks so much! [21:22:02] 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops, 10Release-Engineering-Team-TODO (201907): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10Dzahn) >>! In T207707#5271473, @hashar wrote: > * a LVM volume group pvcreate /dev/md3... [21:22:56] 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops, 10Release-Engineering-Team-TODO (201907): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10Dzahn) ` root@contint1001:/mnt/docker# df -h Filesystem Size... [21:23:44] 10Operations, 10ops-eqiad, 10Analytics, 10hardware-requests, and 2 others: Upgrade kafka-jumbo100[1-6] to 10G NICs (if possible) - https://phabricator.wikimedia.org/T220700 (10wiki_willy) @elukey - just wanted to follow up on this...@RobH will dig around for some quotes and recommendations [21:23:48] 10Operations, 10ops-eqiad, 10Analytics, 10hardware-requests, and 2 others: Upgrade kafka-jumbo100[1-6] to 10G NICs (if possible) - https://phabricator.wikimedia.org/T220700 (10RobH) So these are all in warranty until 2020-05-31, so we will want to add in 10G NICs that are covered by Dell's system warranty.... [21:24:58] (03PS1) 10Catrope: cawiki beta: Configure wgEchoPollForUpdates as a number [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520320 [21:25:13] (03CR) 10Catrope: [C: 03+2] cawiki beta: Configure wgEchoPollForUpdates as a number [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520320 (owner: 10Catrope) [21:26:03] 10Operations, 10ops-eqiad: rack/setup/install puppetmaster1003.eqiad.wmnet - https://phabricator.wikimedia.org/T201342 (10wiki_willy) Checked @RobH , who will continue with the install. Thanks, Willy [21:26:12] (03Merged) 10jenkins-bot: cawiki beta: Configure wgEchoPollForUpdates as a number [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520320 (owner: 10Catrope) [21:26:31] (03CR) 10jenkins-bot: cawiki beta: Configure wgEchoPollForUpdates as a number [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520320 (owner: 10Catrope) [21:29:49] 10Operations, 10ops-eqiad, 10Analytics, 10hardware-requests, and 2 others: Upgrade kafka-jumbo100[1-6] to 10G NICs (if possible) - https://phabricator.wikimedia.org/T220700 (10RobH) [21:34:43] 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops, 10Release-Engineering-Team-TODO (201907): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10Dzahn) a:05Dzahn→03hashar [21:36:16] 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops, 10Release-Engineering-Team-TODO (201907): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10Dzahn) Hi Hashar, at this point i think it makes sense to assign back to to you to chec... [21:41:34] 10Operations, 10ops-eqiad: (OoW) Degraded RAID on analytics1039 - https://phabricator.wikimedia.org/T226599 (10wiki_willy) [21:42:20] 10Operations, 10MediaWiki-extensions-CentralAuth, 10Traffic, 10Performance-Team (Radar), and 2 others: Consistent HTTP 503 Varnish Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10Tgr) Annoyingly, adding an `X-Wikimedia-Debug` header... [21:44:02] 10Operations, 10ops-eqiad, 10Thumbor, 10serviceops, 10User-jijiki: (OoW) thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10wiki_willy) [21:51:39] 10Operations, 10ops-eqiad, 10Performance-Team (Radar): tungsten disk 1 and 8 SMART failure - https://phabricator.wikimedia.org/T193628 (10wiki_willy) @Imarlier @fgiunchedi - just wanted to circle back, and see if tungsten can be decommissioned now (as proposed from last year) . Much appreciated in advance.... [21:51:53] 10Operations, 10ops-eqiad, 10Performance-Team (Radar): (OoW) tungsten disk 1 and 8 SMART failure - https://phabricator.wikimedia.org/T193628 (10wiki_willy) [21:53:30] 10Operations, 10ops-eqiad, 10cloud-services-team: (OoW) cloudvirt1006 - RAID battery failed - https://phabricator.wikimedia.org/T222950 (10wiki_willy) [21:57:18] 10Operations, 10procurement: Wildcard SSL Cert Renewal (*.corp.wikimedia.org) - https://phabricator.wikimedia.org/T227149 (10Aklapper) [22:02:51] 10Operations, 10ops-eqiad, 10ops-ulsfo, 10DC-Ops: connect atlas-ulsfo to scs-ulsfo - https://phabricator.wikimedia.org/T206185 (10wiki_willy) a:05Cmjohnson→03RobH @RobH - if you still need this, thinking maybe you can just pull the adaptor when you're out at EQIAD later this month, and bring it back wi... [22:04:11] 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops, 10Release-Engineering-Team-TODO (201907): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10thcipriani) >>! In T207707#5302044, @Dzahn wrote: > Hi Hashar, at this point i think it... [22:05:55] 10Operations, 10ops-eqiad, 10User-Elukey: (OoW) Heating alerts and broken RAM on kafka1014 - https://phabricator.wikimedia.org/T204479 (10wiki_willy) [22:07:51] 10Operations, 10ops-eqiad, 10User-Elukey: (OoW) Heating alerts and broken RAM on kafka1014 - https://phabricator.wikimedia.org/T204479 (10wiki_willy) @Dzahn - just wanted to circle back around on this, and see if kafka1014 can be decommissioned. Thanks, Willy [22:11:22] (03CR) 10CRusnov: "Thanks for the review, replies / questions inline." (035 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) (owner: 10CRusnov) [22:11:28] 10Operations, 10vm-requests, 10cloud-services-team (Kanban): Three small ganeti VMs to host haproxy for OpenStack endpoints - https://phabricator.wikimedia.org/T227041 (10Andrew) > How is corosync/pacemaker going to work then with a single VIP? I may be missing something but we have range of service IPs th... [22:13:27] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10wiki_willy) @Ottomata - can you reach out to Chris on IRC and schedule a time with him on this one? Sounds... [22:30:13] (03PS7) 1020after4: phabricator: Manage ownership of /var/log/phd/ssh.log [puppet] - 10https://gerrit.wikimedia.org/r/519532 (https://phabricator.wikimedia.org/T224677) [22:36:23] (03CR) 10Dzahn: [C: 03+2] phabricator: Manage ownership of /var/log/phd/ssh.log [puppet] - 10https://gerrit.wikimedia.org/r/519532 (https://phabricator.wikimedia.org/T224677) (owner: 1020after4) [22:36:55] 10Operations, 10ops-eqiad, 10netops: (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad - https://phabricator.wikimedia.org/T226424 (10wiki_willy) [22:37:30] 10Operations, 10ops-eqiad, 10netops: (Need By: Sept 30) upgrade msw1-eqiad from EX4200 to EX4300 - https://phabricator.wikimedia.org/T225121 (10wiki_willy) [22:39:28] (03CR) 1020after4: "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/519532 (https://phabricator.wikimedia.org/T224677) (owner: 1020after4) [22:42:31] (03PS1) 10Ayounsi: Add rpkicounter [puppet] - 10https://gerrit.wikimedia.org/r/520337 [22:52:14] 10Operations, 10ops-eqiad: (OoW) Broken memory on mw1239 - https://phabricator.wikimedia.org/T209139 (10wiki_willy) [22:53:46] (03CR) 10Dzahn: [C: 03+2] "np! right after merging i got kicked off the Internet.. but now it's actually applied on phab1003 and /var/log/phd/ssh.log is there and ow" [puppet] - 10https://gerrit.wikimedia.org/r/519532 (https://phabricator.wikimedia.org/T224677) (owner: 1020after4) [22:58:44] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1001/17201/netflow1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/520337 (owner: 10Ayounsi) [23:00:04] MaxSem, RoanKattouw, and Niharika: Time to snap out of that daydream and deploy Evening SWAT (Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190702T2300). [23:00:05] RoanKattouw: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:05:04] I'm here, trying to get my laptop and the wifi to cooperate [23:09:05] Alright let's roll [23:09:15] Looks like I'm the only one that has patches listed, so I'll do the SWAT myself [23:26:46] 10Operations, 10ops-eqiad, 10Cassandra, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10wiki_willy) @Eevans - can you reach out to Chris on IRC to schedule specific days for this?... [23:34:48] !log catrope@deploy1001 Synchronized php-1.34.0-wmf.11/skins/MonoBook/: T226594 (duration: 00m 50s) [23:34:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:53] T226594: Wiki pages are very wide in Monobook for logged in users - https://phabricator.wikimedia.org/T226594 [23:36:09] !log catrope@deploy1001 Synchronized php-1.34.0-wmf.11/extensions/Echo/: T226594 (duration: 00m 51s) [23:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:17] Achievement unlocked: deploy from Amtrak wifi [23:41:10] heh [23:41:19] RoanKattouw: Many more badges like that to collect ;) [23:41:43] I've definitely not deployed from a ferry before :) [23:42:42] The sausalito one seems an easy way to fix that :D [23:43:10] Reedy: Psh, the ferry isn't international, it doesn't count. [23:43:16] Ahah [23:43:21] * James_F is still waiting for Reedy's deploy from LEO. [23:43:56] I'll message Elon [23:44:20] Branson looks to be getting there faster.