[01:42:54] 10Operations, 10Analytics, 10vm-requests: eqiad: (3) VM %request for internal analytics web sites - https://phabricator.wikimedia.org/T202013 (10Ottomata) Since these are all meant to host 'websites', how about analytics-web100[123]? [01:42:56] 10Operations, 10Analytics, 10vm-requests: eqiad: (3) VM %request for internal analytics web sites - https://phabricator.wikimedia.org/T202013 (10Ottomata) I like the tool idea too, but prefer the singular: analytics-tool100[123]? [01:46:58] PROBLEM - HHVM rendering on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:47:58] RECOVERY - HHVM rendering on mw1289 is OK: HTTP OK: HTTP/1.1 200 OK - 82324 bytes in 0.154 second response time [02:36:05] 10Operations, 10Wikimedia-Mailing-lists: Password Reset Link - https://phabricator.wikimedia.org/T202247 (10Geekdidi) [02:36:55] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.16) (duration: 14m 23s) [02:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:41:31] 10Operations, 10Wikimedia-Mailing-lists: Password reset request for wikimedia-nd mailing list - https://phabricator.wikimedia.org/T202247 (10Legoktm) [02:47:17] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Mon Aug 20 02:47:17 UTC 2018 (duration 10m 22s) [02:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:31:29] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 875.41 seconds [03:48:29] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 211.53 seconds [04:18:58] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 51.41, 25.67, 15.40 [04:19:38] PROBLEM - High CPU load on API appserver on mw1228 is CRITICAL: CRITICAL - load average: 48.62, 25.55, 15.05 [04:20:08] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 48.41, 27.83, 15.70 [04:22:39] PROBLEM - High CPU load on API appserver on mw1228 is CRITICAL: CRITICAL - load average: 45.98, 35.33, 20.80 [04:26:39] RECOVERY - High CPU load on API appserver on mw1228 is OK: OK - load average: 10.78, 24.81, 20.07 [04:28:19] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 8.02, 23.49, 20.77 [04:30:18] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 8.86, 22.79, 23.18 [05:12:14] 10Operations, 10DBA: rack/setup/install dbproxy101[2-7].eqiad.wmnet - https://phabricator.wikimedia.org/T196690 (10Marostegui) Thank you guys! We'll take it from here [06:44:48] PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 49.43, 24.11, 15.11 [06:53:28] RECOVERY - High CPU load on API appserver on mw1234 is OK: OK - load average: 9.78, 22.53, 20.23 [06:57:29] RECOVERY - Check systemd state on ms-be2024 is OK: OK - running: The system is fully operational [06:57:48] !log reset-failed debmonitor failed session on ms-be2024 ^^^^ [06:57:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:21] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/451254 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [07:08:56] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/451538 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [07:09:02] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/453371 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [07:09:06] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/453372 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [07:09:12] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/453373 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [07:28:18] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` ['cp4032.ulsfo.wmnet', 'cp2005.codfw.wmnet'] ``` The log can be found in `/var/l... [07:29:22] !log resetting management card on elastic1022 [07:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:24] (03CR) 10Muehlenhoff: [C: 031] "This looks really nice!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/452664 (https://phabricator.wikimedia.org/T201140) (owner: 10Giuseppe Lavagetto) [07:34:17] (03PS1) 10Muehlenhoff: Extended MOU date for flemmerich [puppet] - 10https://gerrit.wikimedia.org/r/453937 [07:35:29] RECOVERY - IPMI Sensor Status on elastic1022 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [07:36:26] 10Operations, 10ops-eqiad, 10Discovery, 10Discovery-Search, 10Elasticsearch: check elastic1022 power supply redundancy - https://phabricator.wikimedia.org/T177631 (10Gehel) 05Open>03Resolved It looks like a reset of the management interface fixed the reporting issue to ipmi-sensors: gehel@elasti... [07:38:43] (03CR) 10Muehlenhoff: [C: 032] Extended MOU date for flemmerich [puppet] - 10https://gerrit.wikimedia.org/r/453937 (owner: 10Muehlenhoff) [07:50:46] (03CR) 10Gehel: [C: 031] "LGTM" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/451254 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [07:51:24] (03CR) 10Volans: [C: 032] Add confctl module to interact with conftool [software/spicerack] - 10https://gerrit.wikimedia.org/r/451254 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [07:53:09] (03Merged) 10jenkins-bot: Add confctl module to interact with conftool [software/spicerack] - 10https://gerrit.wikimedia.org/r/451254 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [07:55:06] 10Operations, 10Wikimedia-Mailing-lists: Password reset request for wikimedia-nd mailing list - https://phabricator.wikimedia.org/T202247 (10Aklapper) Have you asked wpportharcourt (other admin) for the password already? [07:55:30] (03CR) 10Jcrespo: [C: 032] toolsdb: Ignore s51290__dpl_p replication on toolsdb replica [puppet] - 10https://gerrit.wikimedia.org/r/453355 (https://phabricator.wikimedia.org/T202055) (owner: 10Jcrespo) [07:55:40] (03PS2) 10Jcrespo: toolsdb: Ignore s51290__dpl_p replication on toolsdb replica [puppet] - 10https://gerrit.wikimedia.org/r/453355 (https://phabricator.wikimedia.org/T202055) [07:55:52] (03CR) 10Gehel: "One more minor comment." (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/451538 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [07:59:31] (03CR) 10Volans: "reply inline" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/451538 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [07:59:54] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453943 (owner: 10Marostegui) [08:01:05] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp2005.codfw.wmnet', 'cp4032.ulsfo.wmnet'] ``` and were **ALL** successful. [08:01:37] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453943 (owner: 10Marostegui) [08:03:14] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db1100 (duration: 01m 02s) [08:03:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:35] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2069 storage crash - https://phabricator.wikimedia.org/T201603 (10Marostegui) I have compared the tables: `echo_target_page`, `echo_event` `echo_notification` across all wikis and no differences have been found. So I believe we are good to go [08:12:03] !log mobrovac@deploy1001 Started deploy [restbase/deploy@a3ae0d3] (dev-cluster): Remove contentmodel from MW API revision request - T201974 [08:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:10] T201974: Deprecation of API "action=query&prop=revisions&!rvslots" - https://phabricator.wikimedia.org/T201974 [08:13:05] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1100 (duration: 00m 50s) [08:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:01] !log Deploy schema change on db1100 to check for regressions [08:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:05] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453943 (owner: 10Marostegui) [08:16:10] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy [08:16:19] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@a3ae0d3] (dev-cluster): Remove contentmodel from MW API revision request - T201974 (duration: 04m 16s) [08:16:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:48] 10Operations: Support for QLogic FastLinQ 41112 Dual Port 10Gb SFP+ Adapter - https://phabricator.wikimedia.org/T202255 (10MoritzMuehlenhoff) [08:18:49] (03PS1) 10Ema: ATS: allow to specify caching rules [puppet] - 10https://gerrit.wikimedia.org/r/453960 (https://phabricator.wikimedia.org/T199720) [08:18:49] !log mobrovac@deploy1001 Started deploy [restbase/deploy@a3ae0d3] (dev-cluster): Remove contentmodel from MW API revision request [08:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:51] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@a3ae0d3] (dev-cluster): Remove contentmodel from MW API revision request (duration: 06m 02s) [08:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:13] !log mobrovac@deploy1001 Started deploy [restbase/deploy@a3ae0d3]: Remove contentmodel from MW API revision request - T201974 [08:25:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:19] T201974: Deprecation of API "action=query&prop=revisions&!rvslots" - https://phabricator.wikimedia.org/T201974 [08:25:36] (03PS11) 10Jcrespo: db backup statistics: Initial implementation of the backup stats [puppet] - 10https://gerrit.wikimedia.org/r/449681 (https://phabricator.wikimedia.org/T198987) [08:26:40] (03PS2) 10Ema: ATS: add caching rules support [puppet] - 10https://gerrit.wikimedia.org/r/453960 (https://phabricator.wikimedia.org/T199720) [08:27:21] (03CR) 10Gehel: [C: 031] "comment inline, but minor enough, feel free to merge if you want to." (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/451538 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [08:31:26] !log installing systemd updates from stretch 9.5 point release [08:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:20] PROBLEM - puppet last run on mw1294 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [08:37:44] (03CR) 10Marostegui: [C: 031] Revert "mariadb: Depool db2069 due to crash" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451844 (owner: 10Jcrespo) [08:38:32] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453961 [08:40:41] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453961 (owner: 10Marostegui) [08:41:53] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453961 (owner: 10Marostegui) [08:42:53] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1100 (duration: 00m 50s) [08:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:30] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@a3ae0d3]: Remove contentmodel from MW API revision request - T201974 (duration: 18m 17s) [08:43:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:36] T201974: Deprecation of API "action=query&prop=revisions&!rvslots" - https://phabricator.wikimedia.org/T201974 [08:43:44] !log mobrovac@deploy1001 Started deploy [restbase/deploy@a3ae0d3]: Remove contentmodel from MW API revision request [08:43:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:16] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453961 (owner: 10Marostegui) [08:48:34] (03CR) 10Volans: [C: 032] Add remote module to interact with Cumin (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/451538 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [08:49:29] (03Merged) 10jenkins-bot: Add remote module to interact with Cumin [software/spicerack] - 10https://gerrit.wikimedia.org/r/451538 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [08:50:12] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@a3ae0d3]: Remove contentmodel from MW API revision request (duration: 06m 29s) [08:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:41] (03CR) 10Volans: [C: 032] config: directly inject global config path [software/spicerack] - 10https://gerrit.wikimedia.org/r/453371 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [08:57:35] (03Merged) 10jenkins-bot: config: directly inject global config path [software/spicerack] - 10https://gerrit.wikimedia.org/r/453371 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [08:58:15] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2069 storage crash - https://phabricator.wikimedia.org/T201603 (10jcrespo) Not sure exactly how you checked, but I saw one error in the first 20 wikis I checked: ``` echo angwikiquote | while read db; do echo "$db..."; ./compare.py wikidatawiki echo... [08:58:47] (03CR) 10Jcrespo: [C: 04-1] "See https://phabricator.wikimedia.org/T201603#4513954" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451844 (owner: 10Jcrespo) [08:58:59] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2069 storage crash - https://phabricator.wikimedia.org/T201603 (10Marostegui) I checked against db2033 only [09:00:05] (03CR) 10Volans: [C: 032] log: directly inject running user [software/spicerack] - 10https://gerrit.wikimedia.org/r/453372 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:01:09] (03Merged) 10jenkins-bot: log: directly inject running user [software/spicerack] - 10https://gerrit.wikimedia.org/r/453372 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:03:14] 10Operations, 10TechCom-RFC, 10Traffic, 10Patch-For-Review, and 3 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10mobrovac) >>! In T201409#4500541, @Joe wrote: > We also need internal requests to be traced, so I would assume we need all servi... [09:06:39] RECOVERY - puppet last run on mw1294 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:09:12] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2069 storage crash - https://phabricator.wikimedia.org/T201603 (10jcrespo) Let's fix the issue above and let me continue a full check- at the moment there is no rush to repool it, we can reevaluate later. We may find more issues, even if not relevant... [09:10:33] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2069 storage crash - https://phabricator.wikimedia.org/T201603 (10Marostegui) Yeah, looks like there might be inconsistencies eqiad <-> codfw for all hosts :( [09:29:32] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: eqiad row D switch upgrade - https://phabricator.wikimedia.org/T172459 (10Marostegui) [09:30:31] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install labnodepool1002.eqiad.wmnet - https://phabricator.wikimedia.org/T168407 (10hashar) 05Open>03Resolved It is being reused as ` cloudservices1003.wikimedia.org` T201439 [09:31:10] !log uploaded nodejs 6.11.0~dfsg-1+wmf2+jessie to apt.wikimedia.org [09:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:07] 10Operations, 10Traffic, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531 (10Addshore) >>! In T99531#4513194, @abian wrote: >>>! In T99531#4411395, @abian wrote: >> wikiba.se is a bit unstable. Today it has been dow... [09:46:40] 10Operations, 10Wikimedia-Mailing-lists: Password reset request for wikimedia-nd mailing list - https://phabricator.wikimedia.org/T202247 (10Geekdidi) Yes. [10:03:25] (03PS1) 10Muehlenhoff: Update Cumin aliases for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/453984 [10:07:19] !log uploaded nodejs 6.11.0~dfsg-1+wmf2 to apt.wikimedia.org [10:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:25] (03CR) 10Muehlenhoff: [C: 032] Update Cumin aliases for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/453984 (owner: 10Muehlenhoff) [10:11:50] (03PS3) 10Volans: Add service locator class Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/453373 (https://phabricator.wikimedia.org/T199079) [10:11:52] (03PS4) 10Volans: Add dnsdisc module to manipulate DNS Discovery [software/spicerack] - 10https://gerrit.wikimedia.org/r/451814 (https://phabricator.wikimedia.org/T199079) [10:12:45] (03CR) 10Volans: "Ready" [software/spicerack] - 10https://gerrit.wikimedia.org/r/453373 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [10:13:06] (03CR) 10Volans: "Replies inline" (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/451814 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [10:23:41] !log rebooting mw2234 for some kernel tests [10:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:27] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: eqiad1: move neutron db to m5-master [puppet] - 10https://gerrit.wikimedia.org/r/453987 (https://phabricator.wikimedia.org/T202261) [10:30:04] jan_drewniak: Time to snap out of that daydream and deploy Wikimedia Portals Update. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180820T1030). [10:31:55] !log rebooting mw1261 for kernel update/some tests [10:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:47] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10aborrero) >>! In T199125#4510282, @RobH wrote: >>>! In T199125#4509644, @aborrero wrote: >> Ok, >> >> @RobH let's assume we won't be using the 2x10G NI... [10:46:09] 10Operations, 10Wikimedia-Mailing-lists: Password reset request for wikimedia-nd mailing list - https://phabricator.wikimedia.org/T202247 (10Aklapper) @Geekdidi: What was the result then? [10:46:40] (03PS1) 10ArielGlenn: move huwiki, arwiki to 'bigwikis' for xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/453990 (https://phabricator.wikimedia.org/T202268) [10:47:35] (03CR) 10jerkins-bot: [V: 04-1] move huwiki, arwiki to 'bigwikis' for xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/453990 (https://phabricator.wikimedia.org/T202268) (owner: 10ArielGlenn) [10:52:07] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` ['cp4031.ulsfo.wmnet', 'cp2026.codfw.wmnet'] ``` The log can be found in `/var/l... [10:55:25] (03CR) 10Arturo Borrero Gonzalez: [C: 032] "Compiler is happy https://puppet-compiler.wmflabs.org/compiler02/12135/" [puppet] - 10https://gerrit.wikimedia.org/r/453987 (https://phabricator.wikimedia.org/T202261) (owner: 10Arturo Borrero Gonzalez) [10:55:31] 10Operations, 10Wikimedia-Mailing-lists: Password reset request for wikimedia-nd mailing list - https://phabricator.wikimedia.org/T202247 (10Geekdidi) It's kinda funny, because, both anthony.mcgreat@gmail.com and wpportharcourt@gmail.com are both controlled by me. The former is my personal email, while the la... [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy European Mid-day SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180820T1100). [11:00:04] Tulsi, tgr, and Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:13] o/ [11:00:16] I can SWAT today [11:00:30] tgr|away: want to deploy your change yourself? [11:00:55] zeljkof: sure, can do [11:01:10] tgr: go ahead then, while I review other patches [11:01:52] !log T202261 disabled puppet in cloudcontrol1003.wikimedia.org, cloudcontrol1004.wikimedia.org, clounet1003.eqiad.wmnet, cloudnet1004.eqiad.wmnet [11:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:00] (03CR) 10Gergő Tisza: [C: 032] Allow all bureaucrats to remove interface-admin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450450 (owner: 10Gergő Tisza) [11:02:00] T202261: cloudvps: eqiad1: move neutron db to m5-master - https://phabricator.wikimedia.org/T202261 [11:02:31] Please also instruct me ! I am new to deployment process. https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/451823/ [11:03:17] (03Merged) 10jenkins-bot: Allow all bureaucrats to remove interface-admin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450450 (owner: 10Gergő Tisza) [11:03:59] hi Tulsi, you're next, after tgr is done [11:04:06] Ok zeljkof [11:04:32] I don't know, what to do? [11:04:42] Tulsi: I'll deploy the commit to mwdebug1002, you should test there before I deploy it, do you need help testing there? [11:04:46] :( [11:05:00] Yes [11:05:21] (03PS2) 10ArielGlenn: move huwiki, arwiki to 'bigwikis' for xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/453990 (https://phabricator.wikimedia.org/T202268) [11:05:26] Tulsi: ok, it's not complicated, there is documentation https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug#Staging_changes [11:05:54] Tulsi: the docs may look too complicated, but you have to install the extension, enable it and select mwdebug1002 in the extension options [11:06:08] then you go to any wmf wiki and test there, that's all [11:06:13] (03CR) 10jerkins-bot: [V: 04-1] move huwiki, arwiki to 'bigwikis' for xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/453990 (https://phabricator.wikimedia.org/T202268) (owner: 10ArielGlenn) [11:06:27] Tulsi: when done, just disable the extension, and let me know if you have any questions [11:07:04] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:450450|Allow all bureaucrats to remove interface-admin]] (duration: 00m 54s) [11:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:11] zeljkof: done [11:07:19] (03PS1) 10Volans: Add a retry decorator [software/spicerack] - 10https://gerrit.wikimedia.org/r/453994 (https://phabricator.wikimedia.org/T199079) [11:07:27] tgr: great! I'll take over swat then [11:08:02] Tulsi: I'll review and merge your change now, I'll let you know when it's at mwdebug1002, I can test there if you have trouble testing there, but you'll have to let me know how to test :) [11:08:20] I mean, what to do to test if the change works fine [11:08:51] Okay. I'm here thanks for helping out [11:09:09] (03PS3) 10ArielGlenn: move huwiki, arwiki to 'bigwikis' for xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/453990 (https://phabricator.wikimedia.org/T202268) [11:09:28] (03PS4) 10Zfilipin: Enable Rollbacker User Group at ru.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451823 (https://phabricator.wikimedia.org/T200201) (owner: 10Tulsi Bhagat) [11:09:44] This is my first time. I'm having trouble testing. :( [11:11:00] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451823 (https://phabricator.wikimedia.org/T200201) (owner: 10Tulsi Bhagat) [11:11:09] Tulsi: did you install the extension? [11:12:10] Tulsi: the links are at above "contents" box https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug [11:12:18] (03Merged) 10jenkins-bot: Enable Rollbacker User Group at ru.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451823 (https://phabricator.wikimedia.org/T200201) (owner: 10Tulsi Bhagat) [11:13:43] Tulsi: your commit has been deployed to mwdebug1002, but only there so far, until we confirm it works fine [11:14:10] (03PS1) 10Muehlenhoff: Enable intel-microcode for all bare metal servers with an Intel CPU [puppet] - 10https://gerrit.wikimedia.org/r/453997 (https://phabricator.wikimedia.org/T127825) [11:15:19] kart_: I guess you are the one to ask, just noticed this in fatalmonitor: `500 Unknown modifier 'R': [/^page\-User\:BeneBot.+/RfD\-open/text$/] in /srv/mediawiki/php-1.32.0-wmf.16/extensions/Translate/stringmangler/StringMatcher.php on line 100` [11:15:48] Sorry, It's installing. [11:16:23] Tulsi: ok, let me know when you're ready [11:16:34] Urbanecm: around for swat? [11:16:47] zeljkof: There's already a bug for that filed [11:16:48] (03CR) 10jenkins-bot: Allow all bureaucrats to remove interface-admin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450450 (owner: 10Gergő Tisza) [11:16:50] (03CR) 10jenkins-bot: Enable Rollbacker User Group at ru.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451823 (https://phabricator.wikimedia.org/T200201) (owner: 10Tulsi Bhagat) [11:16:57] Reedy: thanks! [11:17:53] (03PS1) 10Muehlenhoff: Update obsolete Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/453998 [11:21:13] Tulsi: still installing? if you want to learn how to test, I'll wait, but if you think this is the only time you're doing a deployment, I can test for you [11:21:44] Already installed [11:22:20] Extension is on [11:22:31] how to test? [11:22:53] now just go to a wiki, probably https://ru.wikiquote.org and test if the commit works [11:23:00] (03CR) 10Ema: [C: 031] "Just a couple of typos in the commit message, looks good to me and pcc otherwise: https://puppet-compiler.wmflabs.org/compiler02/12136/" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/453997 (https://phabricator.wikimedia.org/T127825) (owner: 10Muehlenhoff) [11:23:08] when done, disable (off) the extension [11:23:20] oh ok doing [11:24:16] Did it break anything in real on-wiki? [11:24:30] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp2026.codfw.wmnet', 'cp4031.ulsfo.wmnet'] ``` and were **ALL** successful. [11:24:44] Tulsi: I don't know, does something look broken? [11:24:54] (03CR) 10Volans: "See inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/453998 (owner: 10Muehlenhoff) [11:25:12] !log T202261 icinga downtime 1h for cloudcontrol1003.wikimedia.org, cloudcontrol1004.wikimedia.org, clounet1003.eqiad.wmnet, cloudnet1004.eqiad.wmnet previous to patch merge [11:25:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:21] T202261: cloudvps: eqiad1: move neutron db to m5-master - https://phabricator.wikimedia.org/T202261 [11:26:31] (03PS2) 10Muehlenhoff: Enable intel-microcode for all bare metal servers with an Intel CPU [puppet] - 10https://gerrit.wikimedia.org/r/453997 (https://phabricator.wikimedia.org/T127825) [11:27:53] Tulsi: do you need help? can I deploy the change? should I revert it? [11:28:25] zeljkof https://ru.wikiquote.org/wiki/%D0%A3%D1%87%D0%B0%D1%81%D1%82%D0%BD%D0%B8%D0%BA:Tulsi_Bhagat/Sandbox [11:28:51] I have created sandbox. Still not getting how to test [11:29:04] The extension is on [11:29:52] Tulsi: well, what does your commit in gerrit do? it adds a feature to a group of users, right? [11:30:39] Tulsi: can you check if the feature is enabled for that group? there should be a special page listing features for groups, I think [11:30:42] (03CR) 10Ema: Update obsolete Cumin aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/453998 (owner: 10Muehlenhoff) [11:30:51] Ok checking out [11:32:28] Yes, the usergroup is on [11:32:56] Tulsi: ok, so does the feature work? can I deploy it? [11:34:06] yes, deploy it please [11:34:16] (03CR) 10Muehlenhoff: [C: 032] Update obsolete Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/453998 (owner: 10Muehlenhoff) [11:35:13] Tulsi: ok, deploying [11:36:14] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:451823|Enable Rollbacker User Group at ru.wikiquote (T200201)]] (duration: 00m 53s) [11:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:20] T200201: Rollbacker in Russian Wikiquote - https://phabricator.wikimedia.org/T200201 [11:36:39] Tulsi: it's deployed, disable the extension and check that the feature works [11:37:09] Urbanecm: around for swat? [11:37:59] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453692 (https://phabricator.wikimedia.org/T202228) (owner: 10Urbanecm) [11:38:07] Okay. Thanks a lot for helping out & So Sorry, I'm new to this. I just wanted to help out on Phab & gerrit. [11:38:41] Tulsi: no problem at all, I'm here to help! :D I'm glad we were able to deploy it, the fist deploy is always hard [11:38:53] Tulsi: just make sure the feature works now that it's fully deployed [11:39:16] (03Merged) 10jenkins-bot: Upload HD logos for lfnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453692 (https://phabricator.wikimedia.org/T202228) (owner: 10Urbanecm) [11:39:17] (to test production, disable the extension) [11:39:23] Okay [11:40:25] Tulsi: let me know if you have any questions and see you for another SWAT deploy :) [11:41:03] 10Operations, 10Wikimedia-Mailing-lists: Password reset request for wikimedia-nd mailing list - https://phabricator.wikimedia.org/T202247 (10Aklapper) Ah, heh, alright. Didn't know that, sorry. :) [11:41:50] !log zfilipin@deploy1001 Synchronized static/images/project-logos/: SWAT: [[gerrit:453692|Upload HD logos for lfnwiki (T202228)]] (duration: 00m 50s) [11:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:57] T202228: Report that lfnwiki logo is seen blurry by some contributors - https://phabricator.wikimedia.org/T202228 [11:43:17] zeljkof: Yeah sure ! Thanks.. Nice meeting you... :) [11:43:40] Tulsi: nice meeting you too! :) [11:44:02] (03PS2) 10Zfilipin: Use HD logos for lfnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453693 (https://phabricator.wikimedia.org/T202228) (owner: 10Urbanecm) [11:47:12] (03CR) 10jenkins-bot: Upload HD logos for lfnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453692 (https://phabricator.wikimedia.org/T202228) (owner: 10Urbanecm) [11:47:21] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453693 (https://phabricator.wikimedia.org/T202228) (owner: 10Urbanecm) [11:48:37] (03Merged) 10jenkins-bot: Use HD logos for lfnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453693 (https://phabricator.wikimedia.org/T202228) (owner: 10Urbanecm) [11:50:07] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:453693|Use HD logos for lfnwiki (T202228)]] (duration: 00m 50s) [11:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:14] T202228: Report that lfnwiki logo is seen blurry by some contributors - https://phabricator.wikimedia.org/T202228 [11:51:09] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2069 storage crash - https://phabricator.wikimedia.org/T201603 (10jcrespo) Strange, maybe there is a bug or a race condition? ``` angwikisource... 2018-08-20T09:27:39.462984: row id 269950842/273790891, ETA: 00m22s, 0 chunk(s) found different DIFFEREN... [11:52:39] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2069 storage crash - https://phabricator.wikimedia.org/T201603 (10Marostegui) Maybe it was caught in the middle of a transaction or something? [11:52:42] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453383 (https://phabricator.wikimedia.org/T202127) (owner: 10Wangql) [11:53:55] (03Merged) 10jenkins-bot: Adding Chinese Wikiversity's logos: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453383 (https://phabricator.wikimedia.org/T202127) (owner: 10Wangql) [11:55:15] (03Abandoned) 10Gergő Tisza: Allow bureaucrats to remove 'interface-admin' right in plwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453179 (https://phabricator.wikimedia.org/T202085) (owner: 10Ankry) [11:55:33] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2069 storage crash - https://phabricator.wikimedia.org/T201603 (10jcrespo) It could be the missing auto-commit, only taking an effect over WAN: https://gerrit.wikimedia.org/r/#/c/operations/software/wmfmariadbpy/+/449185/ [11:56:21] !log zfilipin@deploy1001 Synchronized static/images/project-logos/: SWAT: [[gerrit:453383|Adding Chinese Wikiversitys logos (T202127)]] (duration: 00m 50s) [11:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:27] T202127: Change the default logo of Chinese Wikiversity - https://phabricator.wikimedia.org/T202127 [11:59:06] (03CR) 10Zfilipin: "Purged: T202127#4514403" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453383 (https://phabricator.wikimedia.org/T202127) (owner: 10Wangql) [12:01:25] (03CR) 10Zfilipin: "reckeck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453696 (https://phabricator.wikimedia.org/T202139) (owner: 10Urbanecm) [12:01:31] !log T202261 extend icinga downtime 1D for cloudcontrol1003.wikimedia.org, cloudcontrol1004.wikimedia.org, clounet1003.eqiad.wmnet, cloudnet1004.eqiad.wmnet neutron not properly syncing with agents [12:01:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:37] T202261: cloudvps: eqiad1: move neutron db to m5-master - https://phabricator.wikimedia.org/T202261 [12:02:36] !log EU SWAT finished [12:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:42] (03CR) 10jenkins-bot: Use HD logos for lfnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453693 (https://phabricator.wikimedia.org/T202228) (owner: 10Urbanecm) [12:02:44] (03CR) 10jenkins-bot: Adding Chinese Wikiversity's logos: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453383 (https://phabricator.wikimedia.org/T202127) (owner: 10Wangql) [12:03:33] zeljkof: sorry, missed the message. [12:03:40] Is Nikerabbit aware? [12:04:13] kart_: no problem, Reedy said there's a phab task for it, I did not check, was busy with swat [12:04:23] OK. [12:08:19] yes I am aware and there is a task for it [12:08:39] Thanks Nikerabbit [12:13:22] 10Operations, 10Maps, 10Maps-Sprint, 10Reading-Infrastructure-Team-Backlog: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 (10Gehel) [12:16:28] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 34 probes of 317 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [12:17:45] (03CR) 10Marostegui: [C: 031] "I am happy with how this looks like and all the suggestions by Filippo have been implemented already." [puppet] - 10https://gerrit.wikimedia.org/r/449681 (https://phabricator.wikimedia.org/T198987) (owner: 10Jcrespo) [12:18:58] PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:19:18] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed [12:20:04] (03PS12) 10Jcrespo: db backup statistics: Initial implementation of the backup stats [puppet] - 10https://gerrit.wikimedia.org/r/449681 (https://phabricator.wikimedia.org/T198987) [12:20:58] (03CR) 10Jcrespo: [C: 032] db backup statistics: Initial implementation of the backup stats [puppet] - 10https://gerrit.wikimedia.org/r/449681 (https://phabricator.wikimedia.org/T198987) (owner: 10Jcrespo) [12:21:04] (03CR) 10Gehel: [C: 031] "LGTM" (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/451814 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [12:21:28] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 16 probes of 317 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [12:21:43] wikitech is down to me [12:21:48] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest [12:21:59] Amir1: yeah, same here [12:22:02] checking [12:22:22] the DB is up [12:22:55] maybe arturo did something- he was asking about m5 ealier [12:24:19] Thanks! [12:25:21] not someting as in bad, but some maintenance or something in progress [12:25:38] This is arturo's change: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/453987/ [12:25:58] the db is also m5? [12:26:03] yes [12:26:04] I added a new GRANT [12:26:04] same db [12:26:20] arturo: which one? [12:26:22] wikitech is up for me [12:26:27] marostegui: https://phabricator.wikimedia.org/T202261#4514328 [12:26:30] Yeah, it is up now [12:26:32] oh, no, just the cache [12:26:40] please, let's check the DB configuration to see if I did something wrong [12:26:47] there are weird things [12:26:51] arturo: it is not that easy [12:26:58] there are dozens of accounts [12:27:12] wikitech is up for me [12:27:34] was something done? [12:27:35] that is why we are so encouraging about documentation- but again, we don't know yet this is related [12:27:53] ok, keep me posted in case I can be of any help [12:27:53] it could be something else, just pined in case you knew some maintenance was being done or something [12:28:03] well, wikitech is kind of your realm [12:28:14] some more like you keep us posted :-) [12:28:18] yeah ^^U [12:28:29] maybe a network issue? [12:28:33] could be [12:29:02] check https://grafana.wikimedia.org/dashboard/db/mysql?panelId=5&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1073&var-port=9104&from=1534766850576&to=1534768123590 [12:29:28] Did anyone do something? Or it got self fixed? [12:29:28] monitoring failed for some amount of time, and we know that happened febore [12:29:42] !log upgrading scb2006 to latest nodejs along with rolling restarts of node-based services [12:29:44] and it is unlikely to be affected by grant changes [12:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:36] there are a lot of puppet errors, too [12:30:51] so maybe some network issue, could be nova involved? [12:31:00] either as a cause or a consequence [12:31:19] but a cloud network issue wouldn't affect productin monitoring [12:31:24] db1073 is on b3, that row isn't affected by the recent network issues no? [12:31:25] so maybe production network issue? [12:31:27] I cannot remember which row was it [12:31:33] who knows at this point [12:31:58] toolforge nodes are seeing 500 on the puppetmaster, briefly, so perhaps a networking issue? [12:32:38] Link is down / Link is up but 31 July [12:33:31] There were/are too many connections errors on db1073 [12:33:45] yeah, but those could be the grants arturo was talkinga bout [12:33:48] RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational [12:33:59] it wouldn't cause a connectivity issue? [12:34:00] db1073 is m5? [12:34:00] They are for now [12:34:02] https://logstash.wikimedia.org/goto/a8ac6d883398498aa626944ec43f741a [12:34:04] arturo: yes [12:34:08] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - maintain-dbusers is active [12:34:39] I had an issue with neutron agents, I had to restart them many times (possibly causing many connections to m5) [12:34:45] [2018-08-20 12:18:47] SERVICE ALERT: dbproxy1005;haproxy failover;CRITICAL;SOFT;1;CRITICAL check_failover servers up 1 down 1 [12:35:18] was that sent to this cannel? [12:35:27] so so far we seen to have a mysql connection issue [12:35:31] maybe an overload [12:35:54] we can reload the proxy- it is not active [12:36:06] I guess it comes from the too many connections [12:36:17] jynus: +1 to reload [12:36:19] threads connected = 500 [12:36:26] which probably is the maximum [12:36:36] so arturo your change may be making mysql connections fail [12:36:40] yeah, as I said, logtash is showing too many connections errors [12:36:53] ACK [12:37:00] I can rollback the change if required [12:37:12] so I don't think there is a network issue, but your change makes nova, keystyone, wikitech fail [12:37:12] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453696 (https://phabricator.wikimedia.org/T202139) (owner: 10Urbanecm) [12:37:13] what's the status with those agent restart you mentioned? [12:37:17] and all m5 services [12:38:08] I can kill all current users, but if they reconnect, we will gain nothing [12:38:09] marostegui: there seems to be some rabbitmq issue (internal to the servers) which makes them try to resync every time with the master neutron server which, I guess, produces many queries to the DB [12:38:21] arturo: most of the connections are coming from keystone and nova user [12:38:35] I can try relaxing heartbeats between the agents and the server [12:38:42] marostegui: I would expect the neutron user [12:39:05] arturo: neutron has 62, whereas nova has 148 and keystone 211 [12:39:05] anyway keystone is used almost for everything in openstack (every time auth is used, which is many times) [12:39:24] I don't know what they do or if it is normal, just saying the figures I am seeing [12:39:29] (03CR) 10Alex Monk: [C: 04-1] "Manually rebased beta cherry-pick over I7b7812995d67a6567753fdee63cb5d611e9f07e7 and I6270ac140c57a41713947c4b6f937bb7cc344a95" [puppet] - 10https://gerrit.wikimedia.org/r/452689 (https://phabricator.wikimedia.org/T158837) (owner: 10Krinkle) [12:39:35] also, worth noting that keystone is used by the old nova-network openstack deployment (i.e, the one in prod) [12:40:15] I can simply shutdown neutron right now and see if the server relaxes [12:40:42] I've done a clean up [12:40:47] 134 clients connected now [12:40:55] but it may likely go up [12:41:04] 222 [12:41:14] arturo: maybe this can help https://phabricator.wikimedia.org/T188589 [12:41:17] 234 [12:41:39] arturo: you need to limit the per-user connections [12:41:43] on the connection pools [12:41:58] 328 connections [12:42:21] you cannot just have hundreds of connections per user [12:42:32] unless you have dedicated mysql instances [12:42:48] which user? [12:42:55] that causes not only service degradation, but causes all other services to go down [12:42:57] arturo: all [12:43:03] keystone [12:43:06] designate [12:43:07] nova [12:43:11] From what I can see nova and keystone have almost 300 connections summing both of them [12:43:14] neutron has 16 [12:43:15] glance [12:43:19] neutron [12:43:36] I am guessing all of those are "openstack" accounts using conection pooling [12:43:59] I can share a processlist with you in private, arturo [12:44:03] if that helps [12:44:44] 10Operations, 10Cloud-Services, 10DBA, 10Patch-For-Review: db1009 overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589 (10jcrespo) 05Resolved>03Open [12:45:08] do you have metrics of that numbers? are they only today? [12:45:20] are the new high numbers related to my today's change? [12:45:31] (03CR) 10Gehel: "Minor comments inline, otherwise LGTM" (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/453994 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [12:46:31] arturo: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=37&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1073&var-port=9104&from=1534721915831&to=1534769138577 [12:47:17] 10Operations, 10Cloud-Services, 10DBA, 10Patch-For-Review: db1009 overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589 (10jcrespo) ``` MariaDB [(none)]> select user, host, count(*) FROM information_schema.processlist GROUP BY USER, HOST; +-----------------+-------... [12:47:37] the sudden drop at the end is jynus's restart? [12:47:48] I kill all sleeping processes [12:48:00] but it is only a time until they come back, I would guess [12:48:08] 10Operations, 10Cloud-Services, 10DBA, 10Patch-For-Review: db1009 overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589 (10Marostegui) p:05Normal>03High [12:48:40] 10Operations, 10Cloud-Services, 10DBA, 10Patch-For-Review: db1009 overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589 (10jcrespo) From manuel https://grafana.wikimedia.org/dashboard/db/mysql?panelId=37&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server... [12:48:52] 10Operations, 10Cloud-Services, 10DBA, 10Patch-For-Review: m5-master overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589 (10jcrespo) [12:49:00] arturo: https://phabricator.wikimedia.org/T188589 [12:49:14] In particular https://phabricator.wikimedia.org/T188589#4514545 [12:49:44] jynus: please edit https://phabricator.wikimedia.org/T188589#4514545 and add ```lines=10 or something [12:50:42] it seems nova an keystone have a large number of connections- we can put a hard limit to that, but that would make those fail [12:50:50] but at least it wouldn't make wikitech fail [12:51:42] !log reload dbproxy1005 [12:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:05] I would like to discuss with my team further changes (regarding limits, etc) [12:54:13] 10Operations, 10Cloud-Services, 10DBA, 10Patch-For-Review: m5-master overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589 (10jcrespo) Causing wikitech access errors, among others: https://logstash.wikimedia.org/goto/4d71579b957ae7e197c04882fa9dcd7c [12:54:29] arturo: but we need a solution for now, it is causing wikitech to fail [12:55:01] That would explain more of this....reading back. The logs I've been reading are quite odd. [12:55:46] We can temporarily increase the max connections to e.g. 1000, but the reason it is so low is probably to not make the problem worse [12:56:04] so it need an app-layer limit [12:56:17] e.g. limit the number of connection pool per user more [12:56:31] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting Access to view EventLogging data for gabriel-wmde / gbirke - https://phabricator.wikimedia.org/T202072 (10Ottomata) To access EventLogging data in MySQL, you should be in the `researchers` group, and access it from stat1006. To access EventLog... [12:56:37] feel free to suggest on ticket, we will help [12:57:18] we can enforce per-user limits, but that should only be as a last resort, a hard limit [12:57:24] (at db layer) [12:58:08] the limiting patch for nova was reverted bc `This broke VM creation` [12:59:17] so I would ask to raise DB limits [12:59:23] to avoid contention for now [12:59:39] meanwhile we work in a long-term solution from the openstack POV [12:59:41] I will raise it to 800 for now [12:59:44] And see how it goes [13:00:12] thanks [13:00:31] !log Increase max_connections from 500 to 800 on db1073 to triage issues - T188589 [13:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:38] T188589: m5-master overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589 [13:01:18] is it possible to have a DB just for us? [13:01:37] (just for openstack, I mean) [13:01:42] 10Operations, 10Cloud-Services, 10DBA, 10Patch-For-Review: m5-master overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589 (10aborrero) I asked the DBA team to raise limits for now to avoid contention. We should work on a long term solution to avoid saturating the... [13:01:50] 10Operations, 10Cloud-Services, 10DBA, 10Patch-For-Review: m5-master overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589 (10Marostegui) For now I have done: ``` root@db1073.eqiad.wmnet[(none)]> show global variables like 'max_connections'; +-----------------+---... [13:03:14] arturo: If you mean your own instance (a mysql instance running on the same host on a different port than 3306) we could work that out, but your applications should be able to connect to an specific port (and not assume mysql will be on 3306). If you mean your own servers, that requires hardware purchasement [13:04:09] fair enough [13:04:26] will think on this while on lunch [13:04:28] bbl [13:06:17] I see everything recovering with the temp change, so this is what was breaking puppet. I'll take a look as well in case I can be of some help. [13:07:39] bstorm_: yeah, this was affecting all the databases in m5 master - the temporary change should be reverted and whatever was causing the issue must be fixed [13:07:54] 👍🏻 [13:08:59] I hope you understand why not just increasing it arbitrarily, as it may just delay or make the problem worse in the long term [13:09:14] it is ok for now, though, if it doesn't keep increasing quickly [13:11:05] (03CR) 10Volans: "Replies inline, I'll work on the changes while we found an agreement on the remaining open question" (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/453994 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:14:08] Oh I get that. I haven't done much with the openstack directly, but I'm going to look in case fresh eyes see something helpful [13:30:21] (03PS1) 10Ottomata: Release 0.1.3 with both upstream and wikimedia improvements [debs/git-fat] (debian) - 10https://gerrit.wikimedia.org/r/454017 (https://phabricator.wikimedia.org/T202100) [13:30:47] (03CR) 10Ottomata: [V: 032 C: 032] Release 0.1.3 with both upstream and wikimedia improvements [debs/git-fat] (debian) - 10https://gerrit.wikimedia.org/r/454017 (https://phabricator.wikimedia.org/T202100) (owner: 10Ottomata) [13:33:13] PROBLEM - Check systemd state on elastic2029 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:34:02] PROBLEM - Check systemd state on elastic2018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:34:22] (03PS1) 10WMDE-Fisch: Enable moved paragrah detection everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454019 (https://phabricator.wikimedia.org/T199800) [13:36:44] ^^ checking elastic, I think it is still the same issue [13:43:43] yep, this is still the same, I'm on it [13:44:33] PROBLEM - Check systemd state on elastic2016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:46:27] (03PS2) 10Gehel: Mjolnir daemons should run with Restart=always [puppet] - 10https://gerrit.wikimedia.org/r/453450 (https://phabricator.wikimedia.org/T202120) (owner: 10EBernhardson) [13:46:51] (03PS1) 10Bstorm: keystone: limit the size of the connection pool so it reuses connections [puppet] - 10https://gerrit.wikimedia.org/r/454020 (https://phabricator.wikimedia.org/T188589) [13:46:51] !log installing jetty9 security updates [13:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:20] (03CR) 10Gehel: [C: 032] Mjolnir daemons should run with Restart=always [puppet] - 10https://gerrit.wikimedia.org/r/453450 (https://phabricator.wikimedia.org/T202120) (owner: 10EBernhardson) [13:47:35] (03CR) 10jerkins-bot: [V: 04-1] keystone: limit the size of the connection pool so it reuses connections [puppet] - 10https://gerrit.wikimedia.org/r/454020 (https://phabricator.wikimedia.org/T188589) (owner: 10Bstorm) [13:48:24] (03CR) 10Bstorm: "Let me just preface this with: this is my first openstack patch. That said, I see no reason sqlalchemy cannot make do with a pool of 20 c" [puppet] - 10https://gerrit.wikimedia.org/r/454020 (https://phabricator.wikimedia.org/T188589) (owner: 10Bstorm) [13:48:42] RECOVERY - Check systemd state on elastic2029 is OK: OK - running: The system is fully operational [13:49:24] (03CR) 10Marostegui: "Reminder: once this is merge, let's set max_connections back to 500" [puppet] - 10https://gerrit.wikimedia.org/r/454020 (https://phabricator.wikimedia.org/T188589) (owner: 10Bstorm) [13:50:43] RECOVERY - Check systemd state on elastic2016 is OK: OK - running: The system is fully operational [13:51:19] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: mjolnir-kafka-bulk-daemon failed on all elastic / eqiad nodes - https://phabricator.wikimedia.org/T202120 (10Gehel) Restart=always on the systemd unit should fix the immediate issue. This has been deployed. I'm keeping this task open for a... [13:51:46] (03PS2) 10Bstorm: keystone: limit the size of the connection pool so it reuses connections [puppet] - 10https://gerrit.wikimedia.org/r/454020 (https://phabricator.wikimedia.org/T188589) [13:55:10] (03CR) 10Bstorm: "Nova might need some love as well with regard to connection pooling, but it's probably best to see how this goes first (presuming Arturo o" [puppet] - 10https://gerrit.wikimedia.org/r/454020 (https://phabricator.wikimedia.org/T188589) (owner: 10Bstorm) [13:58:53] RECOVERY - Check systemd state on elastic2018 is OK: OK - running: The system is fully operational [13:59:32] 10Operations, 10Scap, 10Patch-For-Review: Intermittent git-fat failure during deploy - https://phabricator.wikimedia.org/T202100 (10Ottomata) Don't totally remember the context here, but I just built a new version of git-fat with your fix, and installed on deploy1001 `Unpacking git-fat (0.1.3-1~stretch1) ov... [14:11:14] (03PS2) 10WMDE-Fisch: Enable moved paragrah detection everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454019 (https://phabricator.wikimedia.org/T199800) [14:13:24] (03CR) 10Andrew Bogott: [C: 04-1] "I'm all in favor of setting the connection pool size. A few comments inline about our weird hierarchy-that-misuses-hiera." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/454020 (https://phabricator.wikimedia.org/T188589) (owner: 10Bstorm) [14:14:00] !log rolling upgrade of scb in codfw to latest nodejs security update [14:14:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:09] (03CR) 10Bstorm: "Cool! So I can just cut it out of those modules (eqiad1 and labtestn)? I think I'm getting how base sets defaults now that I'm reading m" [puppet] - 10https://gerrit.wikimedia.org/r/454020 (https://phabricator.wikimedia.org/T188589) (owner: 10Bstorm) [14:20:56] (03PS1) 10WMDE-Fisch: Enable moved paragrah detection everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454026 (https://phabricator.wikimedia.org/T199800) [14:23:15] (03PS3) 10WMDE-Fisch: Cleanup wikdiff2 mobile moved paragraph config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454019 (https://phabricator.wikimedia.org/T199800) [14:27:04] (03PS4) 10WMDE-Fisch: Cleanup wikdiff2 mobile moved paragraph config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454019 (https://phabricator.wikimedia.org/T199800) [14:30:23] (03PS3) 10Bstorm: keystone: limit the size of the connection pool so it reuses connections [puppet] - 10https://gerrit.wikimedia.org/r/454020 (https://phabricator.wikimedia.org/T188589) [14:32:05] (03CR) 10Bstorm: keystone: limit the size of the connection pool so it reuses connections (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/454020 (https://phabricator.wikimedia.org/T188589) (owner: 10Bstorm) [14:35:29] (03CR) 10Imarlier: PHP: create module for modern Debian-based distributions (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/452664 (https://phabricator.wikimedia.org/T201140) (owner: 10Giuseppe Lavagetto) [14:36:14] (03PS1) 10Andrew Bogott: Keystone: Include ldap config on eqiad1 keystone hosts [puppet] - 10https://gerrit.wikimedia.org/r/454031 [14:38:13] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet operation_type={create_container,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:39:23] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:40:22] (03CR) 10Muehlenhoff: [C: 031] PHP: create module for modern Debian-based distributions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/452664 (https://phabricator.wikimedia.org/T201140) (owner: 10Giuseppe Lavagetto) [14:49:20] (03CR) 10Imarlier: [C: 031] Backend-Timing Varnish mtail program [puppet] - 10https://gerrit.wikimedia.org/r/434879 (https://phabricator.wikimedia.org/T131894) (owner: 10Gilles) [14:50:44] (03PS2) 10Andrew Bogott: Keystone: Include ldap config on eqiad1 keystone hosts [puppet] - 10https://gerrit.wikimedia.org/r/454031 (https://phabricator.wikimedia.org/T202291) [14:52:34] (03CR) 10Andrew Bogott: [C: 032] Keystone: Include ldap config on eqiad1 keystone hosts [puppet] - 10https://gerrit.wikimedia.org/r/454031 (https://phabricator.wikimedia.org/T202291) (owner: 10Andrew Bogott) [15:00:06] 10Operations, 10ops-codfw: Check/replace PEM2 on cr2-codfw - https://phabricator.wikimedia.org/T202166 (10Papaul) @ayounsi All lights on all PSU's on cr2 are green. What do you want me to do here? [15:00:11] (03PS4) 10Bstorm: keystone: limit the size of the connection pool so it reuses connections [puppet] - 10https://gerrit.wikimedia.org/r/454020 (https://phabricator.wikimedia.org/T188589) [15:01:11] 10Operations, 10TechCom-RFC, 10Traffic, 10Patch-For-Review, and 3 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10Imarlier) >>! In T201409#4513970, @mobrovac wrote: > > If a service receives a request without a req id it means we have a hole... [15:03:32] 10Operations, 10DBA, 10monitoring: HAproxy on dbproxy hosts lack enough logging - https://phabricator.wikimedia.org/T201021 (10Marostegui) The timeout increase was done at: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/450542/ [15:06:05] (03CR) 10Andrew Bogott: [C: 031] keystone: limit the size of the connection pool so it reuses connections [puppet] - 10https://gerrit.wikimedia.org/r/454020 (https://phabricator.wikimedia.org/T188589) (owner: 10Bstorm) [15:09:41] (03CR) 10Arturo Borrero Gonzalez: [C: 031] keystone: limit the size of the connection pool so it reuses connections [puppet] - 10https://gerrit.wikimedia.org/r/454020 (https://phabricator.wikimedia.org/T188589) (owner: 10Bstorm) [15:09:46] (03CR) 10Bstorm: [C: 032] keystone: limit the size of the connection pool so it reuses connections [puppet] - 10https://gerrit.wikimedia.org/r/454020 (https://phabricator.wikimedia.org/T188589) (owner: 10Bstorm) [15:09:53] (03PS5) 10Bstorm: keystone: limit the size of the connection pool so it reuses connections [puppet] - 10https://gerrit.wikimedia.org/r/454020 (https://phabricator.wikimedia.org/T188589) [15:10:49] marostegui: jynus: we are merging a patch to establish some limits in the connection pooling by keystone [15:10:53] cc bstorm_ [15:11:02] let's see what happens in the next 30 mins? [15:11:06] arturo: sure, once merged, let me know so I can go back from 800 connections to 500 [15:11:23] I can also kill idle connections if needed [15:11:30] marostegui: let's wait to lower the limit until we see an actual drop in the usage? [15:11:44] sure [15:11:57] :-) great [15:13:16] arturo: we currently have 371 connections [15:13:46] could you please share that graph link again? [15:13:53] 10Operations, 10monitoring, 10Patch-For-Review: Evaluate Grafana's LDAP group options and deprecate grafana-admin if possible - https://phabricator.wikimedia.org/T170150 (10akosiaris) >>! In T170150#4492605, @Mvolz wrote: > I've tried to log-in with my LDAP credentials and couldn't. I've tried every username... [15:13:58] https://grafana.wikimedia.org/dashboard/db/mysql?panelId=37&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1073&var-port=9104&from=1534767234829&to=1534778034829 [15:15:23] 10Operations, 10DBA, 10monitoring: HAproxy on dbproxy hosts lack enough logging - https://phabricator.wikimedia.org/T201021 (10jcrespo) Should we add prometheus-haproxy-exporter in scope of this, too? [15:16:22] 10Operations, 10DBA, 10monitoring: HAproxy on dbproxy hosts lack enough logging - https://phabricator.wikimedia.org/T201021 (10jcrespo) No need, tracked on T191400 [15:16:41] jynus: you can kill stale conns by keystone [15:16:48] arturo: ok [15:17:22] done [15:18:20] 10Operations, 10TechCom-RFC, 10Traffic, 10Patch-For-Review, and 3 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10Ottomata) > We also need internal requests to be traced, so I would assume we need all services to generate a request Id wheneve... [15:18:55] https://grafana.wikimedia.org/dashboard/db/mysql?panelId=37&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1073&var-port=9104&from=now-30m&to=now [15:18:57] thanks [15:19:01] Keystone should have a max pool of 20 connections now, as long as it doesn't portion that out by running thread, it shouldn't eat connections at this point. It has a lot of running threads, so... [15:19:11] I can check/count [15:19:22] thanks :) [15:19:24] jynus: please do :-P [15:19:37] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting Access to view EventLogging data for gabriel-wmde / gbirke - https://phabricator.wikimedia.org/T202072 (10Addshore) [15:19:45] It has 67 [15:19:48] root@db1073[(none)]> select count(*) FROM information_schema.processlist where user='keystone'; -> 67 [15:19:57] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting access to view EventLogging data for Tim WMDE - https://phabricator.wikimedia.org/T202063 (10Addshore) [15:20:01] are you using it from 3 or more places? [15:20:03] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting access to view EventLogging data for Tonina WMDE - https://phabricator.wikimedia.org/T202069 (10Addshore) [15:20:15] 70 now [15:20:22] Heh [15:20:30] -_- [15:20:34] I may need to set it lower [15:20:41] wait [15:21:04] maybe you are doing it wrong (e.g. several clients) or it needs restart or something? [15:21:14] I am asking before setting too low [15:21:14] It restarted. [15:21:29] But I can do it again in case [15:21:41] maybe showing the source host would be helpful? [15:21:56] Also, forks to some 97 procs [15:22:13] I only see 1 ip, 208.80.154.23 [15:22:14] I am concerned that it may be setting up multiple pools...though that seems nuts [15:22:30] yes, which would explain the crazy amount of connections [15:22:39] 10Operations, 10TechCom-RFC, 10Traffic, 10Patch-For-Review, and 3 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10Pchelolo) > unique user agent (generally the name of the calling job or application). Might be worth exploring doing that as wel... [15:22:42] normally you create 10, 50 at most in a dedicated host for high performance [15:23:05] to give you an idea, our largest enwiki db host has a limit of 64 concurrent threads [15:23:07] 208.80.154.23 is the keystone host [15:23:14] and it does 20K+ QPS [15:23:35] Yeah, I think this thing is crazy with subprocs. Don't get that. I'll take a look at why it's forking so much [15:23:56] so, my suggestion is for you (as in your team) to take it easy [15:24:13] research if something is wrong rather than go too crazy with limits [15:24:28] right now I don't think we are in an emergency [15:24:54] but I don't think the config change had an impact on threads connected [15:24:56] see: [15:25:01] https://grafana.wikimedia.org/dashboard/db/mysql?panelId=37&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1073&var-port=9104&from=now-30m&to=now [15:25:15] ack [15:25:42] 10Operations, 10DBA, 10monitoring: HAproxy on dbproxy hosts lack enough logging - https://phabricator.wikimedia.org/T201021 (10Marostegui) My proposal to get this logging would be to enable: https://cbonte.github.io/haproxy-dconv/configuration-1.5.html#4.2-option%20log-health-checks ``` When this option is... [15:25:54] worse cas scenario, I can enforce limits as "bad cop" with a killer process, but I would like to avoid that [15:26:23] +1 to that, marostegui [15:26:57] jynus: I did a manual restart again [15:27:10] To see if that affects its behavior (can't always trust puppet hooks) [15:28:35] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting access to view EventLogging data for Tim WMDE - https://phabricator.wikimedia.org/T202063 (10Tim_WMDE) Hey @Dzahn, here you go: {F25177022} L3 is signed as well. [15:28:36] Can you check the number again? [15:31:35] bstorm_: keystone has 108 now [15:31:49] (03PS1) 10Marostegui: db-master.cfg: Enable haproxy health-check logging [puppet] - 10https://gerrit.wikimedia.org/r/454039 (https://phabricator.wikimedia.org/T201021) [15:32:40] Yeah. This has 97 workers. Going to limit that as well. [15:32:57] 10Operations, 10TCB-Team, 10wikidiff2, 10WMDE-QWERTY-Sprint-2018-08-14: Update wikidiff2 library on the WMF production cluster to v1.7.2 - https://phabricator.wikimedia.org/T199801 (10WMDE-Fisch) [15:33:08] Apparently that's the issue. No idea why it is spawning so many, but the db connections were unlimited by default, so why not workers! :) [15:34:15] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler02/12137/" [puppet] - 10https://gerrit.wikimedia.org/r/454039 (https://phabricator.wikimedia.org/T201021) (owner: 10Marostegui) [15:34:44] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting access to view EventLogging data for Tonina WMDE - https://phabricator.wikimedia.org/T202069 (10Tonina_Zhelyazkova_WMDE) Hi @Dzahn I've signed `L3` and here's my SSH key ```ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC6GvnAGyZIE/zjBpRQINxGS8fzTTj1Tj... [15:35:56] bstorm_: arturo: based on the graph, we are back to the same levels before the change [15:36:12] Yes. I believe I now know why [15:36:34] Keystone is defaulting to creating a worker process per CPU core. [15:36:39] I'll mess with that value [15:36:41] so you can self-serve at: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=37&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1073&var-port=9104&from=now-1h&to=now [15:36:45] Thanks :) [15:37:47] there is nova, I think with similar issues [15:38:08] 146 connections [15:38:17] in case the same fix works for that too [15:38:44] Yeah...I think this is because of the mitaka upgrade. We aren't using the defaults for this because they are set on apache, which we apparently are not actually running like. [15:38:44] 10Operations, 10Scap, 10Patch-For-Review: Intermittent git-fat failure during deploy - https://phabricator.wikimedia.org/T202100 (10Ottomata) It might need to also be updated on targets hm. [15:40:44] (03CR) 10Jcrespo: [C: 031] "But let's deploy (reload) on a passive proxy first." [puppet] - 10https://gerrit.wikimedia.org/r/454039 (https://phabricator.wikimedia.org/T201021) (owner: 10Marostegui) [15:41:16] (03CR) 10Marostegui: "> But let's deploy (reload) on a passive proxy first." [puppet] - 10https://gerrit.wikimedia.org/r/454039 (https://phabricator.wikimedia.org/T201021) (owner: 10Marostegui) [15:42:24] 10Operations, 10TCB-Team, 10WMDE-QWERTY-Team, 10wikidiff2: Release wikidiff2 v1.7.3 and update the production serves - https://phabricator.wikimedia.org/T202301 (10WMDE-Fisch) [15:42:43] 10Operations, 10TCB-Team, 10WMDE-QWERTY-Team, 10wikidiff2: Release wikidiff2 v1.7.3 and update the production serves - https://phabricator.wikimedia.org/T202301 (10WMDE-Fisch) @Legoktm it would be, again, super awesome if you could take care of the first two parts of this. ( you can use the current master... [15:52:26] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting Access to view EventLogging data for gabriel-wmde / gbirke - https://phabricator.wikimedia.org/T202072 (10Nuria) You will need to sign an NDA regardless @gabriel-wmde you please start doing that. [15:53:11] (03PS1) 10Bstorm: keystone: Limiting worker process numbers [puppet] - 10https://gerrit.wikimedia.org/r/454042 (https://phabricator.wikimedia.org/T188589) [15:54:41] (03CR) 10Arturo Borrero Gonzalez: [C: 031] keystone: Limiting worker process numbers [puppet] - 10https://gerrit.wikimedia.org/r/454042 (https://phabricator.wikimedia.org/T188589) (owner: 10Bstorm) [15:57:16] (03PS2) 10Bstorm: keystone: Limiting worker process numbers [puppet] - 10https://gerrit.wikimedia.org/r/454042 (https://phabricator.wikimedia.org/T188589) [15:57:49] (03CR) 10Lucas Werkmeister (WMDE): [C: 031] "Reviewed the rest. Sandbox properties look good; the deprecated properties are a bit outdated now, but that’s probably our fault for takin" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449017 (owner: 10Matěj Suchánek) [15:58:10] (03PS3) 10Bstorm: keystone: Limiting worker process numbers [puppet] - 10https://gerrit.wikimedia.org/r/454042 (https://phabricator.wikimedia.org/T188589) [15:59:18] (03CR) 10Bstorm: [C: 032] keystone: Limiting worker process numbers [puppet] - 10https://gerrit.wikimedia.org/r/454042 (https://phabricator.wikimedia.org/T188589) (owner: 10Bstorm) [16:02:34] (03PS1) 10Vgutierrez: [WIP] Certcentral integration tests [software/certcentral] - 10https://gerrit.wikimedia.org/r/454045 [16:03:43] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Certcentral integration tests [software/certcentral] - 10https://gerrit.wikimedia.org/r/454045 (owner: 10Vgutierrez) [16:04:42] (03PS1) 10Bstorm: keystone: fix error in the service file [puppet] - 10https://gerrit.wikimedia.org/r/454046 [16:05:52] (03CR) 10Bstorm: [C: 032] keystone: fix error in the service file [puppet] - 10https://gerrit.wikimedia.org/r/454046 (owner: 10Bstorm) [16:10:17] PROBLEM - puppet last run on labtestcontrol2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:11:40] That limited connections [16:11:47] I can tighten it more from here [16:12:06] bstorm_: yeah, keystone has now 27 [16:12:14] Awesome! [16:12:17] Now for nova [16:12:27] Nova currently has 148 [16:12:52] heh [16:12:53] Yeah [16:15:17] RECOVERY - puppet last run on labtestcontrol2003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:22:48] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:24:49] 10Operations, 10ops-codfw: Check/replace PEM2 on cr2-codfw - https://phabricator.wikimedia.org/T202166 (10Papaul) Dear Juniper Networks Customer, Thank you for contacting Juniper Networks Global Support. SR 2018-0820-0440 with Priority P2 has been CREATED by you to track issue as described below. [16:27:08] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:37:36] (03CR) 10Jcrespo: [C: 031] "Support, we had no issues anywhere, although we will need to cleanup afterwards all selective installation as a followup (?)" [puppet] - 10https://gerrit.wikimedia.org/r/453997 (https://phabricator.wikimedia.org/T127825) (owner: 10Muehlenhoff) [16:37:37] 10Operations, 10ops-codfw: Check/replace PEM2 on cr2-codfw - https://phabricator.wikimedia.org/T202166 (10Papaul) @ayounsi Please see below Hi Papaul, Thank you for contacting Juniper Networks, I am Javier Gutierrez from JTAC and I have taken the ownership of this case. From the case notes I could u... [16:40:09] (03CR) 10Muehlenhoff: "Yeah, I'll make a cleanup patch to drop the old Hiera mechanism as a followup." [puppet] - 10https://gerrit.wikimedia.org/r/453997 (https://phabricator.wikimedia.org/T127825) (owner: 10Muehlenhoff) [16:40:11] (03PS1) 10Bstorm: nova: Limit number of worker processes [puppet] - 10https://gerrit.wikimedia.org/r/454055 (https://phabricator.wikimedia.org/T188589) [16:40:53] (03CR) 10jerkins-bot: [V: 04-1] nova: Limit number of worker processes [puppet] - 10https://gerrit.wikimedia.org/r/454055 (https://phabricator.wikimedia.org/T188589) (owner: 10Bstorm) [16:50:44] 10Operations, 10Wikimedia-Logstash, 10Goal, 10User-fgiunchedi, 10User-herron: Shorten logstash retention temporarily - https://phabricator.wikimedia.org/T201971 (10Krinkle) With T201974 solved, overall influx of Logstash databases has dropped over 50%: | {F25179667} | {F25179668} [16:53:43] 10Operations: Onboarding Cole White - https://phabricator.wikimedia.org/T202136 (10Dzahn) root access has been approved in today's SRE meeting [16:54:00] (03PS2) 10Arturo Borrero Gonzalez: nova: Limit number of worker processes [puppet] - 10https://gerrit.wikimedia.org/r/454055 (https://phabricator.wikimedia.org/T188589) (owner: 10Bstorm) [16:55:50] (03CR) 10Arturo Borrero Gonzalez: [C: 031] nova: Limit number of worker processes [puppet] - 10https://gerrit.wikimedia.org/r/454055 (https://phabricator.wikimedia.org/T188589) (owner: 10Bstorm) [16:56:37] (03CR) 10Alexandros Kosiaris: [C: 032] etherpad: Add article to the placeholder text [puppet] - 10https://gerrit.wikimedia.org/r/452716 (owner: 10Ladsgroup) [16:56:45] (03PS2) 10Alexandros Kosiaris: etherpad: Add article to the placeholder text [puppet] - 10https://gerrit.wikimedia.org/r/452716 (owner: 10Ladsgroup) [16:58:37] 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Watching / External): Add contint-roots to releases{1,2}001 - https://phabricator.wikimedia.org/T201470 (10Dzahn) We have had some minor discussion on IRC about it. I mentioned that the point was brought up that even with full root access the u... [16:58:49] (03PS3) 10Smalyshev: Switch entity reference type indexing from opt-in to opt-out [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452956 (https://phabricator.wikimedia.org/T199884) [16:58:57] !log ppchelko@deploy1001 Started deploy [cpjobqueue/deploy@4989231]: Create metrics to track the actual concurrent job executions T202107 [16:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:05] T202107: Job queue should not overload the DB servers when there is replication lag - https://phabricator.wikimedia.org/T202107 [16:59:53] !log ppchelko@deploy1001 Finished deploy [cpjobqueue/deploy@4989231]: Create metrics to track the actual concurrent job executions T202107 (duration: 00m 55s) [16:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:04] gehel: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180820T1700). [17:00:17] jouncebot: o/ [17:00:30] 10Operations, 10Analytics, 10Analytics-Kanban: Move internal sites hosted on thorium to ganeti instance(s) - https://phabricator.wikimedia.org/T202011 (10Ottomata) [17:09:00] jouncebot: slight delay with wdqs deployment, there is an ongoing issue with git-fat [17:09:23] (03PS3) 10Arturo Borrero Gonzalez: nova: Limit number of worker processes [puppet] - 10https://gerrit.wikimedia.org/r/454055 (https://phabricator.wikimedia.org/T188589) (owner: 10Bstorm) [17:09:36] jouncebot: next [17:09:36] In 0 hour(s) and 50 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180820T1800) [17:10:29] (03PS4) 10Arturo Borrero Gonzalez: nova: Limit number of worker processes [puppet] - 10https://gerrit.wikimedia.org/r/454055 (https://phabricator.wikimedia.org/T188589) (owner: 10Bstorm) [17:12:24] (03CR) 10Arturo Borrero Gonzalez: [C: 032] "Compiler is happy https://puppet-compiler.wmflabs.org/compiler02/12138/" [puppet] - 10https://gerrit.wikimedia.org/r/454055 (https://phabricator.wikimedia.org/T188589) (owner: 10Bstorm) [17:15:22] (03PS1) 10Ottomata: Release 0.1.3-2 with merge conflicts fixed [debs/git-fat] (debian) - 10https://gerrit.wikimedia.org/r/454063 [17:15:45] (03CR) 10Ottomata: [V: 032 C: 032] Release 0.1.3-2 with merge conflicts fixed [debs/git-fat] (debian) - 10https://gerrit.wikimedia.org/r/454063 (owner: 10Ottomata) [17:16:04] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [17:17:04] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting. [17:17:51] (03PS3) 10Zhuyifei1999: [WIP] Quarry: Move the install into a venv and upgrade to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/451698 (https://phabricator.wikimedia.org/T192698) [17:27:16] (03CR) 10Gehel: Add ability to load daily category dumps. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/452569 (https://phabricator.wikimedia.org/T201217) (owner: 10Smalyshev) [17:27:23] (03PS1) 10Smalyshev: Enable daily category diffs on test [puppet] - 10https://gerrit.wikimedia.org/r/454067 (https://phabricator.wikimedia.org/T201217) [17:27:26] (03PS5) 10Gehel: Add ability to load daily category dumps. [puppet] - 10https://gerrit.wikimedia.org/r/452569 (https://phabricator.wikimedia.org/T201217) (owner: 10Smalyshev) [17:44:40] !log install iotop on stat1004 [17:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:04] chasemp: ^^ cool ? why just curious whatcha doin?! :) [17:52:47] ottomata: trying to figure things out really, not sure how much luca told you about what we were working on? [17:52:59] nope! [17:53:01] don't think so [17:53:34] ottomata: I'll hit you up in PM :) [17:54:46] (03PS6) 10Smalyshev: Add ability to load daily category dumps. [puppet] - 10https://gerrit.wikimedia.org/r/452569 (https://phabricator.wikimedia.org/T201217) [17:55:45] (03PS7) 10Smalyshev: Add ability to load daily category dumps. [puppet] - 10https://gerrit.wikimedia.org/r/452569 (https://phabricator.wikimedia.org/T201217) [18:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy Morning SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180820T1800). [18:00:04] brion, Hauskatze, and framawiki: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:10] \o/ [18:00:12] o/ [18:00:40] I can *not* SWAT my own changes; I'd appreciate sb else to do it for me, thanks. [18:05:31] !log restarting icinga, re-occurrence of T196336 [18:05:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:38] T196336: Icinga passive checks go awal and downtime stops working - https://phabricator.wikimedia.org/T196336 [18:08:39] (03PS2) 10Ottomata: Use newer librdkafka for eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/452958 (https://phabricator.wikimedia.org/T200769) [18:08:45] (03CR) 10Ottomata: [V: 032 C: 032] Use newer librdkafka for eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/452958 (https://phabricator.wikimedia.org/T200769) (owner: 10Ottomata) [18:09:33] it looks no one is around to SWAT brion? [18:09:41] :( [18:10:02] if need be i can wait for next one :D [18:10:25] I'd prefer not to, it's a hotfix [18:10:30] mine I mean [18:11:39] (checking in other channels if anyone's able to deploy) [18:11:55] great :) thanks [18:12:08] * Hauskatze eyes James_F discretelly [18:12:41] (03PS1) 10Ottomata: Fix librdkafka backports pin for eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/454074 (https://phabricator.wikimedia.org/T200769) [18:13:03] (03CR) 10Ottomata: [V: 032 C: 032] Fix librdkafka backports pin for eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/454074 (https://phabricator.wikimedia.org/T200769) (owner: 10Ottomata) [18:14:34] I can do it. [18:14:37] \o/ [18:14:38] :D [18:14:44] <3 [18:14:45] Sorry, wasn't paying attention. [18:14:57] :D [18:16:24] brion: Hmm. Wouldn't it be better to kill the registration function? [18:16:55] (Eh. Let's merge now and improve later.) [18:16:56] James_F: the registration function has logic in it [18:17:00] let's clean it up laters :D [18:17:14] but yeah over time we should mostly kill it [18:17:44] \o/ [18:20:31] framawiki: You here too? [18:20:35] (03CR) 10Jforrester: [C: 032] Throttle exemption for 24 August [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452954 (https://phabricator.wikimedia.org/T202003) (owner: 10Framawiki) [18:20:37] (03CR) 10Jforrester: [C: 032] Set $wmgUseFooterContactLink on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452955 (https://phabricator.wikimedia.org/T201783) (owner: 10Framawiki) [18:20:38] o/ [18:20:39] (03CR) 10Jforrester: [C: 032] Set $wmgUseFooterContactLink on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453667 (https://phabricator.wikimedia.org/T202014) (owner: 10Framawiki) [18:20:45] Cool. [18:20:45] (03CR) 10jerkins-bot: [V: 04-1] Throttle exemption for 24 August [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452954 (https://phabricator.wikimedia.org/T202003) (owner: 10Framawiki) [18:21:45] James_F: 'll rebase my patchs [18:22:15] Thanks! [18:22:24] Saves me from having to do it. :-) [18:22:56] don't some repos have a bot that does rebase the patches on request? [18:23:00] (03PS1) 10Bstorm: nova: limit metadata workers [puppet] - 10https://gerrit.wikimedia.org/r/454075 (https://phabricator.wikimedia.org/T188589) [18:23:12] I don't like to manually rebase, it's kinda confusing for me [18:23:34] (03PS2) 10Framawiki: Throttle exemption for 24 August [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452954 (https://phabricator.wikimedia.org/T202003) [18:23:51] Hauskatze: In this case, there's a real clash. [18:24:20] (03CR) 10Jforrester: [C: 032] Throttle exemption for 24 August [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452954 (https://phabricator.wikimedia.org/T202003) (owner: 10Framawiki) [18:24:33] James_F: my patch is failing on some NPM error [18:24:39] never ran into that before [18:24:40] * James_F looks. [18:24:40] !log gehel@deploy1001 Started deploy [wdqs/wdqs@df2da41]: new version of wdqs GUI and updater (wdqs1009 only) [18:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:13] !log gehel@deploy1001 Finished deploy [wdqs/wdqs@df2da41]: new version of wdqs GUI and updater (wdqs1009 only) (duration: 00m 32s) [18:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:20] Oh, yes, that. [18:25:41] shasum check failed for /tmp/npm-716-d0932d44/registry.npmjs.org/mwbot/-/mwbot-1.0.10.tgz [18:25:46] idk what's that [18:25:51] looks unrelated to my job [18:25:56] It's a failure in the npm transport. [18:26:08] Happens fairly regularly. :-( [18:26:25] not a blocker? [18:26:33] !log gehel@deploy1001 Started deploy [wdqs/wdqs@df2da41]: new version of wdqs GUI and updater [18:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:40] Nope, should just randomly work next time. [18:26:50] good then [18:27:12] Not really. CI should either pass or fail reliably. [18:27:33] I mean, it's good it's not a blocker [18:27:52] (03Merged) 10jenkins-bot: Throttle exemption for 24 August [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452954 (https://phabricator.wikimedia.org/T202003) (owner: 10Framawiki) [18:28:17] another good-to-fix would-be all those "npm WARN deprecated" stuff that floods the logs [18:28:33] but those are usually dependencies of dependencies^n [18:28:51] Yeah, I'm working on that. [18:29:31] (03CR) 10Bstorm: "Confirmed this results in a noop on labcontrol1001, and making changes for that are not necessary. This will only affect neutron versions" [puppet] - 10https://gerrit.wikimedia.org/r/454075 (https://phabricator.wikimedia.org/T188589) (owner: 10Bstorm) [18:29:37] (03CR) 10Bstorm: "https://puppet-compiler.wmflabs.org/compiler02/12141/" [puppet] - 10https://gerrit.wikimedia.org/r/454075 (https://phabricator.wikimedia.org/T188589) (owner: 10Bstorm) [18:29:38] npm6 => use `npm ci` => use package-lock.json => Manually bump sub-sub-sub-dependencies with `npm audit fix` => Fewer warnings. [18:29:51] (03CR) 10Bstorm: [C: 032] nova: limit metadata workers [puppet] - 10https://gerrit.wikimedia.org/r/454075 (https://phabricator.wikimedia.org/T188589) (owner: 10Bstorm) [18:30:37] (03PS1) 10Jijiki: icinga: Added user jijiki in contacts groups. [puppet] - 10https://gerrit.wikimedia.org/r/454076 (https://phabricator.wikimedia.org/T201816) [18:31:08] !log restarted ircecho on einsteinium [18:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:22] (03CR) 10Dzahn: [C: 031] "we just created the user in private repo, so this can be merged anytime" [puppet] - 10https://gerrit.wikimedia.org/r/454076 (https://phabricator.wikimedia.org/T201816) (owner: 10Jijiki) [18:31:32] It looks it's failing again -or- zuul has not picked yet the new +2 [18:32:12] looks like the old one, per the timestamps [18:32:30] Yeah, it's the old one. [18:32:40] It's also running both PS2 and PS3 of brion's TMH patch. [18:32:45] * James_F sighs at dumbware. [18:32:50] !log jforrester@deploy1001 Synchronized wmf-config/throttle.php: SWAT T202003 Add throttle exemption for 24 August (duration: 00m 51s) [18:32:53] heh [18:32:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:57] T202003: Throttle exemption for event in Ireland - https://phabricator.wikimedia.org/T202003 [18:33:10] framawiki: Well, at least that's one of them [18:33:33] James_F: thanks :) [18:33:54] (03PS2) 10Jforrester: Set $wmgUseFooterContactLink on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452955 (https://phabricator.wikimedia.org/T201783) (owner: 10Framawiki) [18:34:07] (03CR) 10Jforrester: [C: 032] Set $wmgUseFooterContactLink on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452955 (https://phabricator.wikimedia.org/T201783) (owner: 10Framawiki) [18:34:09] (03CR) 10Gehel: [C: 031] Add ability to load daily category dumps. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/452569 (https://phabricator.wikimedia.org/T201217) (owner: 10Smalyshev) [18:34:22] (03PS2) 10Jforrester: Set $wmgUseFooterContactLink on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453667 (https://phabricator.wikimedia.org/T202014) (owner: 10Framawiki) [18:34:32] (03CR) 10Jforrester: [C: 032] Set $wmgUseFooterContactLink on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453667 (https://phabricator.wikimedia.org/T202014) (owner: 10Framawiki) [18:34:45] (03CR) 10Jijiki: [C: 031] icinga: Added user jijiki in contacts groups. [puppet] - 10https://gerrit.wikimedia.org/r/454076 (https://phabricator.wikimedia.org/T201816) (owner: 10Jijiki) [18:35:29] (03Merged) 10jenkins-bot: Set $wmgUseFooterContactLink on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452955 (https://phabricator.wikimedia.org/T201783) (owner: 10Framawiki) [18:35:56] (03Merged) 10jenkins-bot: Set $wmgUseFooterContactLink on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453667 (https://phabricator.wikimedia.org/T202014) (owner: 10Framawiki) [18:36:05] some music in the meanwhile :) [18:36:54] (03PS4) 10Zhuyifei1999: [WIP] Quarry: Move the install into a venv and upgrade to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/451698 (https://phabricator.wikimedia.org/T192698) [18:36:56] (03CR) 10jenkins-bot: Throttle exemption for 24 August [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452954 (https://phabricator.wikimedia.org/T202003) (owner: 10Framawiki) [18:36:58] !log gehel@deploy1001 Finished deploy [wdqs/wdqs@df2da41]: new version of wdqs GUI and updater (duration: 10m 25s) [18:36:58] (03CR) 10jenkins-bot: Set $wmgUseFooterContactLink on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452955 (https://phabricator.wikimedia.org/T201783) (owner: 10Framawiki) [18:37:00] (03CR) 10jenkins-bot: Set $wmgUseFooterContactLink on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453667 (https://phabricator.wikimedia.org/T202014) (owner: 10Framawiki) [18:37:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:15] Okie dokie. [18:37:38] (03PS2) 10Jijiki: icinga: Added user jijiki in contacts groups. [puppet] - 10https://gerrit.wikimedia.org/r/454076 (https://phabricator.wikimedia.org/T201816) [18:37:40] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Quarry: Move the install into a venv and upgrade to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/451698 (https://phabricator.wikimedia.org/T192698) (owner: 10Zhuyifei1999) [18:38:18] framawiki: Both the ruwiki and frwiki footer changes are live on mwdebug1002. Can you test and confirm? [18:39:12] (03PS8) 10Gehel: Add ability to load daily category dumps. [puppet] - 10https://gerrit.wikimedia.org/r/452569 (https://phabricator.wikimedia.org/T201217) (owner: 10Smalyshev) [18:39:59] (03CR) 10Gehel: [C: 032] Add ability to load daily category dumps. [puppet] - 10https://gerrit.wikimedia.org/r/452569 (https://phabricator.wikimedia.org/T201217) (owner: 10Smalyshev) [18:40:03] 10Operations, 10Cloud-Services, 10DBA, 10Patch-For-Review: m5-master overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589 (10Bstorm) I've more than halved the number of nova workers. I didn't see a big drop in the usage on grafana this time. One thing I haven't... [18:41:03] James_F: not good [18:41:23] do you see a "contact" link in the footer of https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:Accueil_principal ? [18:41:54] (03CR) 10Jijiki: [C: 032] icinga: Added user jijiki in contacts groups. [puppet] - 10https://gerrit.wikimedia.org/r/454076 (https://phabricator.wikimedia.org/T201816) (owner: 10Jijiki) [18:42:02] framawiki: I do, on mwdebug1002. [18:42:09] framawiki: Not in full production though. [18:42:17] (Until I sync it.) [18:42:23] (03PS5) 10Zhuyifei1999: [WIP] Quarry: Move the install into a venv and upgrade to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/451698 (https://phabricator.wikimedia.org/T192698) [18:43:17] I'm proceeding. [18:43:57] 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Watching / External): Add contint-roots to releases{1,2}001 - https://phabricator.wikimedia.org/T201470 (10thcipriani) >>! In T201470#4515591, @Dzahn wrote: > Would that work if we continue with more specific sudo privilege lines for using apt?... [18:44:10] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT T201783 T202014 Add Contact link to footers of fr,ruwiki (duration: 00m 50s) [18:44:13] James_F: ok, was a local cache problem. good for me too! [18:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:17] Cool. :-) [18:44:18] T202014: Set $wmgUseFooterContactLink = true on frwiki - https://phabricator.wikimedia.org/T202014 [18:44:19] T201783: Set `$wmgUseFooterContactLink = true` on Russian Wikipedia - https://phabricator.wikimedia.org/T201783 [18:44:24] framawiki: That's you all done. [18:44:30] (03PS3) 10Jijiki: icinga: Added user jijiki in contacts groups. [puppet] - 10https://gerrit.wikimedia.org/r/454076 (https://phabricator.wikimedia.org/T201816) [18:44:34] Sorry to brion and Hauskatze for the slowness. :-( [18:44:44] Not your fault James_F [18:44:51] thanks for SWAT-ting [18:44:59] Well, technically brion's is fixing up my breakage, so… ;-) [18:45:09] But you're welcome. [18:45:39] 10Operations, 10Cloud-Services, 10DBA, 10Patch-For-Review: m5-master overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589 (10Bstorm) The biggest issue overall with the current level was that cloudcontrol1003 has so many cpu cores and worker values in openstack mit... [18:46:48] :D [18:48:40] The TMH one has now been in the queue for 30 minutes(!) but "only" actually processing for the last 7. [18:48:42] Oy. [18:48:59] James_F: thanks ! [18:49:13] At least WikimediaMessages passed now. [18:49:22] Hauskatze: Let's not jinx things. :-) [18:49:34] 10Operations, 10Scap, 10Patch-For-Review: Intermittent git-fat failure during deploy - https://phabricator.wikimedia.org/T202100 (10thcipriani) >>! In T202100#4515232, @Ottomata wrote: > It might need to also be updated on targets hm. Yeah, we'll need to update the targets since that's where we run into the... [18:49:35] i keep wishing phpunit had a parallelization option. one of these days we're just gonna have to hack it on [18:49:43] This system of gluing together the changes... I'd prefer if they could be tested individually [18:49:48] it looks slower to me [18:50:08] brion: c2+v2 manually :P [18:50:26] haha [18:50:33] the fastest tests are those that don't run ;) [18:50:44] let da users experience broken stuff [18:50:48] brion: Given that the tests can break each other, running sequentially is probably better… [18:50:55] it builds character [18:51:16] they.... should be isolated.... in theory :D [18:51:20] I think we should go back to the Good Old Days™ of taking the wiki read-only for a few hours each time we pushed a new version of MW. [18:51:31] oh god no [18:51:33] Like the MW1.17 RL branch merge, for instance. [18:51:35] lol [18:51:39] we'd just be permanent readonly [18:51:53] which admiteddly would solve many of our performance and abuse issues [18:52:00] ^^^ [18:52:08] merged \o/ [18:52:10] Or when Tim declared that MW 1.3 was coming out with a breaking change to template syntax, so he was going to put them read-only, re-write history to use the new syntax, and then bring them up. [18:52:21] >:) [18:52:22] * James_F grins. [18:52:39] Finally. [18:53:57] let me know when the change is on mwdebug and I'll check as well :) [18:54:00] Sure. [18:54:03] 10Operations, 10Scap, 10Patch-For-Review: Intermittent git-fat failure during deploy - https://phabricator.wikimedia.org/T202100 (10Ottomata) we just updated it for wqds* hosts. If that worked fine for @Gehel and Erik, we'll update the rest of the flee (all nodes!) with @MoritzMuehlenhoff when he's back aro... [18:54:21] brion, Hauskatze: Live on mwdebug1002 now. [18:54:34] looks good! "Permitted file types: tiff, tif, png, gif, jpg, jpeg, webp, xcf, mid, ogg, ogv, svg, djvu, stl, oga, flac, opus, wav, webm, mp3." [18:54:41] and noooooo mp4 [18:54:44] Yup. [18:54:45] 10Operations, 10Quarry, 10cloud-services-team (Kanban): let quarry use the mariadb module - https://phabricator.wikimedia.org/T181205 (10zhuyifei1999) @Dzahn @jcrespo Any hint on what is the current equivalent of https://github.com/wikimedia/puppet/blob/production/modules/quarry/manifests/database.pp: ``` cl... [18:54:49] thanks James_F ! [18:55:01] checking [18:55:57] not working https://meta.wikimedia.org/wiki/Special:GlobalUsers?username=&group=otrs-member&limit=50 [18:56:17] !log jforrester@deploy1001 Synchronized php-1.32.0-wmf.16/extensions/TimedMediaHandler/: SWAT T202208Fix regression: double-reg of file types and incorrect mp4 (duration: 00m 52s) [18:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:55] Hauskatze: It'll need an i18n rebuild from scap, I guess. [18:56:57] * James_F sighs. [18:57:06] * Hauskatze headesks [18:57:47] Hauskatze: I can `scap sync` I guess, but… I've never done that. [18:58:12] I'd rather not, if that's OK. [18:58:35] fine for me, but I'm not going to try to submit that again [18:58:45] I'll leave it to others [18:59:13] special:version on mwdebug doesn't list this commit either [18:59:27] It doesn't get updated except for branches. [18:59:37] 'Cos we don't copy the .git dirs around the cluster. [18:59:49] although it is there https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/WikimediaMessages/+/wmf/1.32.0-wmf.16 [18:59:51] I'm slinging it out everywhere. [19:00:04] !log jforrester@deploy1001 Synchronized php-1.32.0-wmf.16/extensions/WikimediaMessages/: SWAT T202095 Reinstate "Rename global OTRS-member group to otrs-member" (will need scap sync later) (duration: 00m 49s) [19:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:10] T202095: Require that CentralAuth's global groups all use lowercase internal identifiers - https://phabricator.wikimedia.org/T202095 [19:00:28] fwiw, isn't it run automatically from time to time? [19:00:39] It does. But not for ~12 hours' time. Is that OK? [19:01:01] if you can't/won't run it James_F; if you know somebody that does know about running it? [19:01:18] I'd prefer to leave this settled, but if not I guess waiting is okay [19:01:23] 10Operations, 10Cloud-Services, 10DBA, 10Patch-For-Review: m5-master overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589 (10Bstorm) The fact that the idle timeout for api database connections is set at an hour by default might be why it didn't drop right away... [19:02:40] No-one's deploying for the hour. I'll do it. [19:02:57] okay, thanks [19:03:03] if it is ok, I shall leave [19:03:13] or do you need me here for something? [19:03:28] anybody knows which mediawiki deployment group the dump hosts are in? [19:03:32] No, it should be fine. [19:03:35] 10Operations, 10Quarry, 10cloud-services-team (Kanban): Let quarry use the mariadb module - https://phabricator.wikimedia.org/T181205 (10Framawiki) [19:03:51] ok, thank you very much for your help James_F :) <3 [19:04:00] !log jforrester@deploy1001 Started scap: i18n sync for WikimediaMessages clean-up following T202095 [19:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:46] (03CR) 10Jforrester: "This was listed for SWAT but not merged. Any reason why?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453696 (https://phabricator.wikimedia.org/T202139) (owner: 10Urbanecm) [19:05:29] ok i gotta run. thanks y'all! [19:05:35] Thanks, brion. :-) [19:07:27] (03PS1) 10Volans: cumin: add alias consistency checker [puppet] - 10https://gerrit.wikimedia.org/r/454077 [19:07:36] (03CR) 10jerkins-bot: [V: 04-1] cumin: add alias consistency checker [puppet] - 10https://gerrit.wikimedia.org/r/454077 (owner: 10Volans) [19:08:51] (03PS2) 10Volans: cumin: add alias consistency checker [puppet] - 10https://gerrit.wikimedia.org/r/454077 [19:16:18] 10Operations, 10Patch-For-Review: Onboarding Effie Mouzeli - https://phabricator.wikimedia.org/T201816 (10jijiki) [19:16:41] 10Operations, 10Patch-For-Review: Onboarding Effie Mouzeli - https://phabricator.wikimedia.org/T201816 (10jijiki) Also paging/notifications is done [19:17:49] (03CR) 10Volans: "Compiler results available here:" [puppet] - 10https://gerrit.wikimedia.org/r/454077 (owner: 10Volans) [19:24:46] (03PS2) 10Volans: Add a retry decorator [software/spicerack] - 10https://gerrit.wikimedia.org/r/453994 (https://phabricator.wikimedia.org/T199079) [19:25:00] (03CR) 10Volans: "Replies inline" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/453994 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [19:28:33] PROBLEM - HHVM rendering on mw2224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:29:24] RECOVERY - HHVM rendering on mw2224 is OK: HTTP OK: HTTP/1.1 200 OK - 82193 bytes in 0.411 second response time [19:32:46] 10Operations, 10Cloud-Services, 10DBA, 10Patch-For-Review: m5-master overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589 (10Bstorm) Actually, nova-api db connections are down to 11 :) Looks like the only remaining problem is nova db itself (nova-conductor). Tha... [19:35:12] !log jforrester@deploy1001 Finished scap: i18n sync for WikimediaMessages clean-up following T202095 (duration: 31m 12s) [19:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:20] T202095: Require that CentralAuth's global groups all use lowercase internal identifiers - https://phabricator.wikimedia.org/T202095 [19:35:55] (03PS1) 10Bstorm: nova: shorten idle timeout for sqlalchemy to reap db connections [puppet] - 10https://gerrit.wikimedia.org/r/454082 (https://phabricator.wikimedia.org/T188589) [19:39:17] (03CR) 10Bstorm: "https://puppet-compiler.wmflabs.org/compiler02/12144/" [puppet] - 10https://gerrit.wikimedia.org/r/454082 (https://phabricator.wikimedia.org/T188589) (owner: 10Bstorm) [19:39:27] (03CR) 10Bstorm: [C: 032] nova: shorten idle timeout for sqlalchemy to reap db connections [puppet] - 10https://gerrit.wikimedia.org/r/454082 (https://phabricator.wikimedia.org/T188589) (owner: 10Bstorm) [19:45:01] 10Operations, 10Analytics, 10vm-requests: eqiad: (3) VM %request for internal analytics web sites - https://phabricator.wikimedia.org/T202013 (10Dzahn) Ok, analytics-tool100[123] it is then. I'll get them IP addresses / add to DNS. [19:45:46] (03PS1) 10Rxy: Allow add or remove interface-admin group by wikidata-staff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454083 (https://phabricator.wikimedia.org/T202065) [19:48:40] 10Operations, 10Analytics, 10vm-requests: eqiad: (3) VM %request for internal analytics web sites - https://phabricator.wikimedia.org/T202013 (10Dzahn) added to the naming conventions page: https://wikitech.wikimedia.org/w/index.php?title=Infrastructure_naming_conventions&type=revision&diff=1800619&oldid=17... [19:55:02] 10Operations, 10Cloud-Services, 10DBA, 10Patch-For-Review: m5-master overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589 (10Marostegui) Thanks a lot Brooke for getting this fixed. I will go back to 500 as max_connections tomorrow morning as it looks fine now. [19:55:06] 10Operations, 10Analytics, 10vm-requests: eqiad: (3) VM %request for internal analytics web sites - https://phabricator.wikimedia.org/T202013 (10Ottomata) Cool les do it [19:55:41] (03PS1) 10Dzahn: introduce analytics-tool100[123], assign v4 IPs [dns] - 10https://gerrit.wikimedia.org/r/454084 (https://phabricator.wikimedia.org/T202013) [19:56:45] (03CR) 10MarcoAurelio: [C: 031] Allow add or remove interface-admin group by wikidata-staff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454083 (https://phabricator.wikimedia.org/T202065) (owner: 10Rxy) [19:56:57] (03CR) 10Dzahn: [C: 04-2] introduce analytics-tool100[123], assign v4 IPs [dns] - 10https://gerrit.wikimedia.org/r/454084 (https://phabricator.wikimedia.org/T202013) (owner: 10Dzahn) [19:59:14] (03PS2) 10Dzahn: introduce analytics-tool100[123], assign v4 IPs [dns] - 10https://gerrit.wikimedia.org/r/454084 (https://phabricator.wikimedia.org/T202013) [19:59:24] 10Operations, 10Fundraising-Backlog, 10Traffic, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10Jgreen) I reran the SSLLabs analyzer on links.e.uso.org today and it's still scored a B, looks like for several issue (still including weak DH). [20:00:05] cscott, arlolra, subbu, bearND, halfak, and Amir1: I, the Bot under the Fountain, allow thee, The Deployer, to do Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180820T2000). [20:00:27] Nothing happens with ORES [20:01:16] (03PS3) 10Dzahn: introduce analytics-tool100[123], assign v4 IPs [dns] - 10https://gerrit.wikimedia.org/r/454084 (https://phabricator.wikimedia.org/T202013) [20:01:41] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: Rack/Setup frbast2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T196417 (10Jgreen) a:03Jgreen [20:02:09] 10Operations, 10ops-eqiad, 10Performance-Team: tungsten disk 1 and 8 SMART failure - https://phabricator.wikimedia.org/T193628 (10Imarlier) @Cmjohnson Yes, we're planning to get our stuff off of here in the near future, at which point this machine can be decommissioned. No need to actually replace these dis... [20:02:23] 10Operations, 10ops-eqiad, 10Performance-Team (Radar): tungsten disk 1 and 8 SMART failure - https://phabricator.wikimedia.org/T193628 (10Imarlier) [20:06:30] (03PS1) 10Dzahn: installserver: add analytics-tool to partman for ganeti VMs [puppet] - 10https://gerrit.wikimedia.org/r/454093 (https://phabricator.wikimedia.org/T202013) [20:06:51] (03PS2) 10Dzahn: installserver: add analytics-tool to partman for ganeti VMs [puppet] - 10https://gerrit.wikimedia.org/r/454093 (https://phabricator.wikimedia.org/T202013) [20:08:23] (03CR) 10Dzahn: [C: 032] installserver: add analytics-tool to partman for ganeti VMs [puppet] - 10https://gerrit.wikimedia.org/r/454093 (https://phabricator.wikimedia.org/T202013) (owner: 10Dzahn) [20:09:53] (03PS2) 10Dzahn: restbase: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/451819 [20:14:57] !log arlolra@deploy1001 Started deploy [parsoid/deploy@44aa5e8]: Updating Parsoid to 129d71f [20:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:09] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/12145/restbase1007.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/451819 (owner: 10Dzahn) [20:16:36] 10Operations, 10ops-eqiad, 10DC-Ops, 10Fundraising-Backlog, 10fundraising-tech-ops: Decom tellurium - https://phabricator.wikimedia.org/T194408 (10cwdent) [20:17:11] 10Operations, 10ops-eqiad, 10DC-Ops, 10Fundraising-Backlog, 10fundraising-tech-ops: Decom tellurium - https://phabricator.wikimedia.org/T194408 (10cwdent) a:05cwdent>03Cmjohnson [20:26:26] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@44aa5e8]: Updating Parsoid to 129d71f (duration: 11m 29s) [20:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:46] 10Operations, 10Quarry, 10cloud-services-team (Kanban): Let quarry use the mariadb module - https://phabricator.wikimedia.org/T181205 (10Dzahn) @zhuyifei1999 I think that would be a combination of: class {'mariadb::packages_wmf': class {'mariadb::service': and class { 'mariadb::config': as used inside ro... [20:28:34] (03PS1) 10Aaron Schulz: Set "cluster" parameter for mcrouter broadcast routing prefixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454142 [20:29:15] !log mbsantos@deploy1001 Started deploy [mobileapps/deploy@cae24fe]: Update mobileapps to 95e976d (T202105) [20:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:22] T202105: Separate pagelib CSS from base CSS - https://phabricator.wikimedia.org/T202105 [20:29:43] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: Rack/Setup frbast2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T196417 (10Jgreen) a:05Jgreen>03Papaul @Papaul could you take another look at this host? I'm not having any luck getting it to boot. /admin1-> racadm serve... [20:30:30] 10Operations, 10ops-eqiad, 10monitoring: rack/setup/install icinga1001.wikimedia.org - https://phabricator.wikimedia.org/T201344 (10Dzahn) i would take this once it gets to service implementation [20:33:50] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Karen Brown - https://phabricator.wikimedia.org/T201668 (10RobH) Please note that I'm now on clinic duty this week, so I need to confirm a few things. This task is c... [20:34:42] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Patrick Earley - https://phabricator.wikimedia.org/T201667 (10RobH) Please note this task is currently blocked on @PEarleyWMF logging into their wikitech account to create the ldap entry (... [20:35:35] davidwbarratt: https://phabricator.wikimedia.org/feed/6591902114377974956/ [20:35:35] !log mbsantos@deploy1001 Finished deploy [mobileapps/deploy@cae24fe]: Update mobileapps to 95e976d (T202105) (duration: 06m 19s) [20:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:41] T202105: Separate pagelib CSS from base CSS - https://phabricator.wikimedia.org/T202105 [20:36:51] !log Updated Parsoid to 129d71f (T130224, T199926) [20:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:00] T130224: Parsoid serialises `\w` to a tag (rather than discarding), which then breaks the page - https://phabricator.wikimedia.org/T130224 [20:37:01] T199926: html -> wt: Parsoid sometimes trips up on | chars in hrefs - https://phabricator.wikimedia.org/T199926 [20:37:20] (03PS2) 10Aaron Schulz: Set "cluster" parameter for mcrouter broadcast routing prefixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454142 [20:37:22] (03CR) 10Aaron Schulz: [C: 032] Set "cluster" parameter for mcrouter broadcast routing prefixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454142 (owner: 10Aaron Schulz) [20:38:22] (03Merged) 10jenkins-bot: Set "cluster" parameter for mcrouter broadcast routing prefixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454142 (owner: 10Aaron Schulz) [20:40:43] (03CR) 10jenkins-bot: Set "cluster" parameter for mcrouter broadcast routing prefixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454142 (owner: 10Aaron Schulz) [20:43:11] !log aaron@deploy1001 Synchronized wmf-config/mc.php: Set "cluster" parameter for mcrouter broadcast routing prefixes (duration: 00m 50s) [20:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:07] Hauskatze AHHH [20:53:17] Amir1: hi, could you reinstate davidwbarratt 's Phab account? cf. https://phabricator.wikimedia.org/T202328 [20:53:33] please and thank you! [20:54:07] hey, I do it rn [20:54:54] davidwbarratt: you should be enabled now, let me know if it's not working [20:55:21] I don't see the grey dot before their username so should be fine [20:55:36] + I've added him to trusted-contributors that should prevent him from being autobanned [20:56:49] !log fdans@deploy1001 Started deploy [analytics/refinery@f59ce0c]: deploying changes to virtualpageview spam site filtering [20:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:42] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting access to view EventLogging data for Tim WMDE - https://phabricator.wikimedia.org/T202063 (10RobH) [21:00:05] bawolff and Reedy: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180820T2100). [21:00:49] ^ I like the #bothumor [21:01:02] #metoo :P [21:01:06] oh wait [21:01:09] (03PS1) 10Krinkle: Remove unused config wgWMETrackGeoFeatures (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454148 [21:01:11] (03PS1) 10Krinkle: Remove unused config wgWMETrackGeoFeatures (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454149 [21:01:28] Amir1 it works! thanks! [21:02:57] 10Operations, 10Growth-Team, 10Notifications: SRE query: Is it possible to measure how many e-mails are sent to "black hole" e-mail addresses? - https://phabricator.wikimedia.org/T202329 (10Jdforrester-WMF) [21:04:27] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:04:51] (03PS1) 10RobH: adding user Tim WMDE [puppet] - 10https://gerrit.wikimedia.org/r/454150 (https://phabricator.wikimedia.org/T202063) [21:07:06] !log fdans@deploy1001 Finished deploy [analytics/refinery@f59ce0c]: deploying changes to virtualpageview spam site filtering (duration: 10m 17s) [21:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:17] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:09:47] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:10:00] (03PS1) 10RobH: adding tieu to groups [puppet] - 10https://gerrit.wikimedia.org/r/454153 (https://phabricator.wikimedia.org/T202063) [21:10:28] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:11:43] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10User-Addshore: Requesting access to view EventLogging data for Tim WMDE - https://phabricator.wikimedia.org/T202063 (10RobH) [21:26:03] (03CR) 10Dzahn: [C: 032] introduce analytics-tool100[123], assign v4 IPs [dns] - 10https://gerrit.wikimedia.org/r/454084 (https://phabricator.wikimedia.org/T202013) (owner: 10Dzahn) [21:36:16] (03CR) 10Gehel: [C: 031] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/453994 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [21:38:42] (03CR) 10Dzahn: [C: 032] "[restbase1007:~] $ host analytics-tool1001.eqiad.wmnet" [dns] - 10https://gerrit.wikimedia.org/r/454084 (https://phabricator.wikimedia.org/T202013) (owner: 10Dzahn) [21:39:19] 10Operations, 10Analytics, 10vm-requests, 10Patch-For-Review: eqiad: (3) VM %request for internal analytics web sites - https://phabricator.wikimedia.org/T202013 (10Dzahn) analytics-tool1001.eqiad.wmnet has address 10.64.32.215 analytics-tool1002.eqiad.wmnet has address 10.64.32.216 analytics-tool1003.eq... [21:48:41] 10Operations, 10Analytics, 10vm-requests, 10Patch-For-Review: eqiad: (3) VM %request for internal analytics web sites - https://phabricator.wikimedia.org/T202013 (10Dzahn) ``` [ganeti1001:~] $ makevm This is an interactive script to make it easier to create a Ganeti VM. Please see https://wikitech.wikimedi... [21:49:43] am i not allowed to make HTTPS requests to https://noc.wikimedia.org/ from stat1006? they are timing out for me. (i have a tool which wants to read https://noc.wikimedia.org/conf/dblists/all.dblist and then query each of those databases) [21:49:49] (this used to work) [21:53:33] MatmaRex: [stat1006:~] $ https_proxy="http://webproxy.eqiad.wmnet:8080" curl https://noc.wikimedia.org [21:54:13] okay, now how do i do it from my Ruby code? ;) [21:54:32] the tool is query-all-dbs.rb in my home dir, if you can see that [21:54:38] i can probably find out myself. thanks mutante [21:54:46] https://giphy.com/gifs/reactiongifs-ujUdrdpX7Ok5W [21:55:50] MatmaRex: i guess export https_proxy="http://webproxy.eqiad.wmnet:8080" ; ruby somefile.rb [21:56:12] note there is http_proxy and https_proxy [21:56:20] and gotta set both if using both [21:59:09] https://stackoverflow.com/questions/6868507/automatically-adding-proxy-to-all-http-connections-in-ruby [22:02:34] (03PS1) 10Framawiki: Throttle exception for 2018-08-29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454160 (https://phabricator.wikimedia.org/T202288) [22:05:11] Gerritbot phab bot is disabled ? [22:05:53] paladox: ^^ [22:07:59] (03CR) 10EBernhardson: [C: 031] Switch entity reference type indexing from opt-in to opt-out [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452956 (https://phabricator.wikimedia.org/T199884) (owner: 10Smalyshev) [22:08:06] (03PS2) 10Framawiki: Throttle exception for 2018-08-29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454160 (https://phabricator.wikimedia.org/T202288) [22:10:38] Nope [22:10:40] Since https://phabricator.wikimedia.org/p/gerritbot/ [22:10:46] Shoes it’s posting [22:12:47] * it is posting [22:13:26] (03CR) 10EBernhardson: [C: 031] elasticsearch: storage device name changed with new partitioning scheme [puppet] - 10https://gerrit.wikimedia.org/r/453094 (https://phabricator.wikimedia.org/T198391) (owner: 10Gehel) [22:13:43] 10Operations: Onboarding Cole White - https://phabricator.wikimedia.org/T202136 (10fgiunchedi) a:03fgiunchedi [22:13:48] !log ganeti1001 - creating 3 new VMs for analytics-tools [22:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:44] 10Operations, 10Graphite: Include ADD operation in memcached stats and grafana dashboard - https://phabricator.wikimedia.org/T201016 (10fgiunchedi) Indeed there is such a dashboard: https://grafana.wikimedia.org/dashboard/db/mcrouter but it seems partial, it does show ADD though! [22:18:46] (03PS2) 10Dzahn: mail::mx: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/451820 [22:23:00] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/12146/mx1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/451820 (owner: 10Dzahn) [22:23:45] 10Operations, 10Analytics, 10vm-requests, 10Patch-For-Review: eqiad: (3) VM %request for internal analytics web sites - https://phabricator.wikimedia.org/T202013 (10Dzahn) Created analytics-tool1002 and analytics-tool1003 as above, all the same except 1003 gets the 6G Memory instead of 4. At the end of th... [22:24:14] 10Operations, 10Analytics, 10vm-requests, 10Patch-For-Review: eqiad: (3) VM %request for internal analytics web sites - https://phabricator.wikimedia.org/T202013 (10Dzahn) [22:26:43] !log ppchelko@deploy1001 Started deploy [cpjobqueue/deploy@541932a]: Make the queue sized report stats once a second T202107 [22:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:50] T202107: Job queue should not overload the DB servers when there is replication lag - https://phabricator.wikimedia.org/T202107 [22:27:30] !log ppchelko@deploy1001 Finished deploy [cpjobqueue/deploy@541932a]: Make the queue sized report stats once a second T202107 (duration: 00m 46s) [22:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:28] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10kaldari) [22:33:09] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10kaldari) [22:33:54] (03PS1) 10Dzahn: install: add MAC addresses for analytics-tool100[123] to DHCP [puppet] - 10https://gerrit.wikimedia.org/r/454167 (https://phabricator.wikimedia.org/T202013) [22:34:16] (03PS2) 10Dzahn: install: add MAC addresses for analytics-tool100[123] to DHCP [puppet] - 10https://gerrit.wikimedia.org/r/454167 (https://phabricator.wikimedia.org/T202013) [22:35:27] (03CR) 10Dzahn: [C: 032] install: add MAC addresses for analytics-tool100[123] to DHCP [puppet] - 10https://gerrit.wikimedia.org/r/454167 (https://phabricator.wikimedia.org/T202013) (owner: 10Dzahn) [22:45:48] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10stjn) A small comment, sorry if I am asking in the wrong place: in the documentation I didn’t see anywher... [22:49:31] (03CR) 10EBernhardson: [C: 031] search.wikimedia.org should properly handle multivalue separation char (0x1F) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446801 (owner: 10DCausse) [22:50:12] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) @stjn Thanks for bringing it up, we do see this kind of abandonment as a possibility. So far, ou... [22:51:25] (03CR) 10Krinkle: search.wikimedia.org should properly handle multivalue separation char (0x1F) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446801 (owner: 10DCausse) [22:52:50] !log ppchelko@deploy1001 Started deploy [cpjobqueue/deploy@c947567]: Use timing for queue sizes as a workaround to grafana not plotting gauges [22:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:36] !log ppchelko@deploy1001 Finished deploy [cpjobqueue/deploy@c947567]: Use timing for queue sizes as a workaround to grafana not plotting gauges (duration: 00m 45s) [22:53:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:30] (03PS1) 10Dzahn: add analytics-tools1001[123] to site with spare role [puppet] - 10https://gerrit.wikimedia.org/r/454172 (https://phabricator.wikimedia.org/T202013) [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180820T2300). [23:00:05] rxy: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:09] o/ [23:00:54] Hi. I can SWAT [23:01:08] hi [23:02:22] (03CR) 1020after4: [C: 032] Allow add or remove interface-admin group by wikidata-staff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454083 (https://phabricator.wikimedia.org/T202065) (owner: 10Rxy) [23:03:41] (03Merged) 10jenkins-bot: Allow add or remove interface-admin group by wikidata-staff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454083 (https://phabricator.wikimedia.org/T202065) (owner: 10Rxy) [23:05:11] (03PS1) 10Dzahn: install_server: fix typo in fixed-address of analytics-tool1003 [puppet] - 10https://gerrit.wikimedia.org/r/454173 (https://phabricator.wikimedia.org/T202013) [23:05:56] 10Operations, 10Traffic, 10monitoring: False alarms on varnish-http-requests 70% GET drop in 30 min alert - https://phabricator.wikimedia.org/T201630 (10fgiunchedi) I believe this alert has fired a few times now and most were false positives, also it is not clear what's the actionable. I went ahead and "soft... [23:05:57] rxy can you test on mwdebug1001? [23:06:18] !log https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/454083/ deployed to mwdebug1001 [23:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:04] twentyafterfour: ok, work correctly [23:07:20] (03CR) 10jenkins-bot: Allow add or remove interface-admin group by wikidata-staff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454083 (https://phabricator.wikimedia.org/T202065) (owner: 10Rxy) [23:08:19] (03CR) 10Dzahn: [C: 032] install_server: fix typo in fixed-address of analytics-tool1003 [puppet] - 10https://gerrit.wikimedia.org/r/454173 (https://phabricator.wikimedia.org/T202013) (owner: 10Dzahn) [23:08:46] !log twentyafterfour@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/454083/ Phab Task: T202065 (duration: 00m 51s) [23:08:48] (03CR) 10Dzahn: [C: 032] add analytics-tools1001[123] to site with spare role [puppet] - 10https://gerrit.wikimedia.org/r/454172 (https://phabricator.wikimedia.org/T202013) (owner: 10Dzahn) [23:08:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:53] T202065: Interface administrators in Wikidata - https://phabricator.wikimedia.org/T202065 [23:08:56] (03PS2) 10Dzahn: add analytics-tools1001[123] to site with spare role [puppet] - 10https://gerrit.wikimedia.org/r/454172 (https://phabricator.wikimedia.org/T202013) [23:09:13] wondered way too long why one VM wouldn't get a DHCP lease.. finally i see it. analytics-too11003.eqiad.wmnet; that's accidental 1337-speak [23:09:28] too-11003 [23:09:37] hah nice [23:09:45] !log Finished evening SWAT [23:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:49] twentyafterfour: thanks. It work correctly at server: mw1252.eqiad.wmnet :) [23:09:58] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS2914/IPv4: Connect, AS2914/IPv6: Connect [23:10:07] * twentyafterfour didn't do it ^ [23:10:12] ;) [23:10:21] rxy: you're welcome! [23:10:37] XioNoX: ^ is that critical? [23:11:09] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10RobH) Ok, sub-task T202345 has the 10G nics, but @aborrero's comment makes me wonder why we're ordering any cloudvirts iwth 10G nics if they are not nee... [23:11:29] it's important (we lost one of our transit providers) but we have enough capacity so it's not critical [23:11:44] XioNoX: ok, great, thx [23:12:37] (03PS2) 10Dzahn: install_server: fix typo in fixed-address of analytics-tool1003 [puppet] - 10https://gerrit.wikimedia.org/r/454173 (https://phabricator.wikimedia.org/T202013) [23:12:47] mutante: https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status [23:12:48] :) [23:13:05] oh, good point :) [23:13:58] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 32, down: 0, shutdown: 0 [23:14:02] i'm supposed to come up with a format for standard links from icinga to wikitech [23:19:19] (03PS1) 10RobH: adding two wmf employees to ldap section [puppet] - 10https://gerrit.wikimedia.org/r/454174 (https://phabricator.wikimedia.org/T202334) [23:19:37] (03PS2) 10RobH: adding two wmf employees to ldap section [puppet] - 10https://gerrit.wikimedia.org/r/454174 (https://phabricator.wikimedia.org/T202334) [23:19:51] (03CR) 10RobH: [C: 032] adding two wmf employees to ldap section [puppet] - 10https://gerrit.wikimedia.org/r/454174 (https://phabricator.wikimedia.org/T202334) (owner: 10RobH) [23:22:57] 10Operations, 10Puppet: Stop introducing new code expanded from erb templates - https://phabricator.wikimedia.org/T200984 (10fgiunchedi) >>! In T200984#4481857, @Volans wrote: > I fully agree with the principle, but I have to admit that I'm also guilty as charged as I've recently add a few lines wrapper bash s... [23:58:19] 10Operations, 10Analytics, 10vm-requests, 10Patch-For-Review: eqiad: (3) VM %request for internal analytics web sites - https://phabricator.wikimedia.org/T202013 (10Dzahn) @Ottomata Done! 3 new VMs have been created with specs as requested and they have been added to puppet with role(spare::system). Just... [23:59:47] 10Operations, 10Analytics, 10Analytics-Kanban: Move internal sites hosted on thorium to ganeti instance(s) - https://phabricator.wikimedia.org/T202011 (10Dzahn) [23:59:52] 10Operations, 10Analytics, 10vm-requests, 10Patch-For-Review: eqiad: (3) VM %request for internal analytics web sites - https://phabricator.wikimedia.org/T202013 (10Dzahn) 05Open>03Resolved see the table i pasted in the ticket description for all the details (which instance for which role, IP, names et...