[00:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate Evening SWAT (Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190111T0000). [00:00:04] kostajh: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:16] 10Operations, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, 10Readers-Web-Backlog (Tracking), 10Services (watching): Create Debian packages for Node.js 10 upgrade - https://phabricator.wikimedia.org/T203239 (10Krinkle) [00:00:18] James_F: sure, that would be great. This is our deployment plan fwiw https://www.mediawiki.org/wiki/User:KHarlan_%28WMF%29/Help_Panel_Deployment_Plan [00:00:19] I'll take SWAT. [00:00:38] So 483323 first? [00:01:00] James_F: yes please [00:01:27] (03CR) 10Jforrester: [C: 03+2] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483323 (owner: 10Catrope) [00:03:00] (03PS2) 10Jforrester: Revert "Revert "Help panel: Set help desk page correctly on kowiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483323 (owner: 10Catrope) [00:03:06] (03CR) 10Jforrester: [C: 03+2] "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483323 (owner: 10Catrope) [00:03:19] Sorry, didn't spot the merge conflict. :-) [00:03:47] oops, me neither [00:04:35] (03Merged) 10jenkins-bot: Revert "Revert "Help panel: Set help desk page correctly on kowiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483323 (owner: 10Catrope) [00:04:41] James_F: looks like the other two have conflicts too. I'll fix them [00:05:37] (03PS3) 10Kosta Harlan: Enable GrowthExperiments help panel on cswiki and kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482373 (https://phabricator.wikimedia.org/T211993) (owner: 10Catrope) [00:06:00] kostajh: 483323 is live on mwdebug1002, and looks good to me. [00:06:06] James_F: ok one sec [00:06:16] (03CR) 10jenkins-bot: Revert "Revert "Help panel: Set help desk page correctly on kowiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483323 (owner: 10Catrope) [00:06:22] kostajh: > var_dump(GrowthExperiments\HelpPanel::getHelpDeskTitle(RequestContext::getMain()->getConfig())); -> "질문방/2019년_1월" [00:07:09] James_F: ok! [00:07:22] Sync to prod? [00:07:27] James_F: yes please [00:08:10] Next is 482373? [00:08:14] James_F: yes [00:08:18] (03CR) 10Jforrester: [C: 03+2] Enable GrowthExperiments help panel on cswiki and kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482373 (https://phabricator.wikimedia.org/T211993) (owner: 10Catrope) [00:08:39] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT Help panel: Set help desk page correctly on kowiki Ia94cfc571 (duration: 00m 46s) [00:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:41] James_F: also, I think we kind of abused the wikidatarepo.dblist for the commons deployment [00:08:47] kostajh: Live. [00:08:51] (03PS5) 10Smalyshev: Create second Blazegraph instance for categories [puppet] - 10https://gerrit.wikimedia.org/r/483628 (https://phabricator.wikimedia.org/T213234) [00:08:52] we should probably rename that if it is going to be a generic wikibaserepo dblist [00:09:11] addshore: Oh, oops, did I call it wikidatarepo not wikibaserepo? My mistake. [00:09:31] aaah, you made it! I thought I did! [00:09:34] James_F: cool, logstash is quiet. [00:10:06] (03Merged) 10jenkins-bot: Enable GrowthExperiments help panel on cswiki and kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482373 (https://phabricator.wikimedia.org/T211993) (owner: 10Catrope) [00:10:10] (03CR) 10jerkins-bot: [V: 04-1] Create second Blazegraph instance for categories [puppet] - 10https://gerrit.wikimedia.org/r/483628 (https://phabricator.wikimedia.org/T213234) (owner: 10Smalyshev) [00:10:21] James_F: will check AF now :) [00:11:13] (03PS4) 10Kosta Harlan: Enable GrowthExperiments help panel for 50% of new users on cswiki and kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482374 (https://phabricator.wikimedia.org/T211993) (owner: 10Catrope) [00:11:15] (03PS6) 10Smalyshev: Create second Blazegraph instance for categories [puppet] - 10https://gerrit.wikimedia.org/r/483628 (https://phabricator.wikimedia.org/T213234) [00:11:19] kostajh: 482373 is live on mwdebug1002, but I guess you want to check in prod? [00:11:34] James_F: we will check it on mwdebug1002 [00:11:44] (03PS5) 10Dzahn: tor::relay: make Tor family configurable and move to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/459876 [00:11:47] OK, leaving it be there then. [00:11:58] James_F: thanks. It will take us 15-20 minutes, is that ok? [00:12:13] !log 482373 is live on mwdebug1002 for extensive checks. [00:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:20] kostajh: Of course, go ahead. [00:12:41] (03CR) 10jerkins-bot: [V: 04-1] Create second Blazegraph instance for categories [puppet] - 10https://gerrit.wikimedia.org/r/483628 (https://phabricator.wikimedia.org/T213234) (owner: 10Smalyshev) [00:13:02] 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Dzahn) [00:13:14] (03CR) 10Addshore: "It would probably be nicer to keep the wikidatarepos and the commonsrepos in separate db lists." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482891 (owner: 10Jforrester) [00:13:22] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14280/torrelay1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/459876 (owner: 10Dzahn) [00:13:32] (03PS7) 10Smalyshev: Create second Blazegraph instance for categories [puppet] - 10https://gerrit.wikimedia.org/r/483628 (https://phabricator.wikimedia.org/T213234) [00:14:20] James_F: https://test-commons.wikimedia.org/wiki/Special:AbuseFilter/examine/log/4 looks good to me [00:14:29] thats for the AF stuff :) obviously [00:15:28] James_F: but in the words of tim, remember, https://tools.wmflabs.org/bash/quip/AU7VTzhg6snAnmqnK_pc [00:16:06] Truer words never spoken. [00:16:33] addshore: OK, syncing. [00:16:48] (03CR) 10Dzahn: [C: 03+2] "noop in production" [puppet] - 10https://gerrit.wikimedia.org/r/459876 (owner: 10Dzahn) [00:18:34] !log jforrester@deploy1001 Synchronized php-1.33.0-wmf.12/extensions/AbuseFilter/includes/AbuseFilterHooks.php: T213453: Use slot in onEditFilterMergedContent and newVariableHolderForEdit in AbuseFilter (duration: 00m 47s) [00:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:40] T213453: AbuseFilter MCR diff is comparing old value of one slot with the new value from another, not the old whole page with the new whole page - https://phabricator.wikimedia.org/T213453 [00:18:48] (03CR) 10jenkins-bot: Enable GrowthExperiments help panel on cswiki and kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482373 (https://phabricator.wikimedia.org/T211993) (owner: 10Catrope) [00:19:00] (03PS1) 10Dzahn: tor: remove duplicate hiera yaml file [puppet] - 10https://gerrit.wikimedia.org/r/483642 [00:19:23] James_F: let me verify with the next real AF log hit that actually has multiple slots :) [00:19:30] (03CR) 10Dzahn: [C: 03+2] tor: remove duplicate hiera yaml file [puppet] - 10https://gerrit.wikimedia.org/r/483642 (owner: 10Dzahn) [00:19:38] James_F: is it possible to add one more patch? Otherwise, we'll have to stop the process. [00:19:52] a yet to be created, but small, patch. [00:19:55] kostajh: Of course. [00:20:44] James_F: thanks. I'll have something up very soon [00:22:06] (03CR) 10Jforrester: "Sorry, yes, I meant to call this 'wikibaserepo' and thought-o'ed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482891 (owner: 10Jforrester) [00:25:19] (03PS1) 10Jforrester: Clean-up: Explain why WBMI wikis don't need wmgWikibaseRepoEntityNamespaces set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483645 [00:26:12] (03PS8) 10Smalyshev: Create second Blazegraph instance for categories [puppet] - 10https://gerrit.wikimedia.org/r/483628 (https://phabricator.wikimedia.org/T213234) [00:27:08] (03CR) 10jerkins-bot: [V: 04-1] Create second Blazegraph instance for categories [puppet] - 10https://gerrit.wikimedia.org/r/483628 (https://phabricator.wikimedia.org/T213234) (owner: 10Smalyshev) [00:27:19] James_F: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/483646, waiting for tests [00:28:24] (03PS9) 10Smalyshev: Create second Blazegraph instance for categories [puppet] - 10https://gerrit.wikimedia.org/r/483628 (https://phabricator.wikimedia.org/T213234) [00:29:40] 10Operations, 10Cloud-VPS, 10Traffic, 10serviceops: Difficulties to create offline version of Wikipedia because of HTTP 429 response - https://phabricator.wikimedia.org/T213475 (10BBlack) It's a confusing set of things going on here, and it's going to need fixups on both the `network/data/data.yaml` side a... [00:32:29] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [00:33:17] 10Operations, 10TCB-Team, 10WMF-JobQueue, 10Core Platform Team Backlog (Next), 10Services (watching): Grafana alerting broken after upgrade to 5.0.0 - https://phabricator.wikimedia.org/T213506 (10Pchelolo) [00:33:41] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [00:34:43] James_F: I'll create a backport patch or will you? [00:35:10] kostajh: I'll do it. [00:38:16] (03PS10) 10Smalyshev: Create second Blazegraph instance for categories [puppet] - 10https://gerrit.wikimedia.org/r/483628 (https://phabricator.wikimedia.org/T213234) [00:38:18] (03CR) 10Aklapper: [C: 03+1] "Yay!" [puppet] - 10https://gerrit.wikimedia.org/r/483623 (owner: 10Dzahn) [00:43:40] kostajh: Sorry this is taking so long. [00:44:20] James_F: no problem, thank you for accommodating this last minute patch [00:45:44] I'm familiar with the practice myself. :-) [00:48:37] (03PS1) 10Ppchelko: Fix grafana alert check to accomodate new grafana version [puppet] - 10https://gerrit.wikimedia.org/r/483653 (https://phabricator.wikimedia.org/T213506) [00:50:25] !log bump prefix limit for AS6939 in eqsin [00:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:52:01] kostajh: WikimediaEvents change is live on mwdebug1002. [00:52:36] James_F: thanks. We are verifying now [01:00:11] James_F: nearly done verifying. Then there's a final config patch. [01:00:29] OK. :-) [01:01:59] James_F: please proceed [01:02:19] Code first, then config? [01:02:25] so that's syncing the WikimediaEvents patch and the GrowthExperiments patch, followed by the config [01:02:33] yes please [01:03:43] !log jforrester@deploy1001 Synchronized php-1.33.0-wmf.12/extensions/WikimediaEvents/includes/PageViews.php: SWAT: T213186 GrowthExperiments: Support templates for help desk title (duration: 00m 46s) [01:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:03:46] T213186: Help Panel: Posting to monthly help desk archive (kowiki) - https://phabricator.wikimedia.org/T213186 [01:05:05] (03CR) 10Jforrester: [C: 03+2] Enable GrowthExperiments help panel for 50% of new users on cswiki and kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482374 (https://phabricator.wikimedia.org/T211993) (owner: 10Catrope) [01:05:30] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT T211993 Enable GrowthExperiments help panel on cswiki and kowiki (duration: 00m 45s) [01:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:33] T211993: Help Panel: Start experiment - https://phabricator.wikimedia.org/T211993 [01:06:11] (03Merged) 10jenkins-bot: Enable GrowthExperiments help panel for 50% of new users on cswiki and kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482374 (https://phabricator.wikimedia.org/T211993) (owner: 10Catrope) [01:08:32] kostajh: Sorry, live on mwdebug1002. Need to test? [01:09:02] (03CR) 10jenkins-bot: Enable GrowthExperiments help panel for 50% of new users on cswiki and kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482374 (https://phabricator.wikimedia.org/T211993) (owner: 10Catrope) [01:09:53] James_F: checking with the team [01:10:09] Cool. [01:13:09] James_F: 3-5 more minutes? [01:15:15] Of course. [01:15:58] 10Operations, 10TCB-Team, 10WMF-JobQueue, 10Core Platform Team Backlog (Next), and 2 others: Grafana alerting broken after upgrade to 5.0.0 - https://phabricator.wikimedia.org/T213506 (10Pchelolo) [01:18:34] James_F: all good! [01:18:56] OK, syncing. [01:20:07] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT T211993 Enable GrowthExperiments help panel for 50% of new users on cswiki and kowiki (duration: 00m 46s) [01:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:20:11] T211993: Help Panel: Start experiment - https://phabricator.wikimedia.org/T211993 [01:20:22] kostajh: OK, are we done? [01:21:06] James_F: yes, all done. Thanks so much! [01:21:22] Happy to help. Congratulations on getting it out the door. :-) [01:22:32] OK, that's a lid for the week. [02:13:35] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:14:47] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:34:07] 10Operations, 10Core Platform Team Backlog (Watching / External), 10Readers-Web-Backlog (Tracking), 10Services (watching): Create Debian packages for Node.js 10 upgrade - https://phabricator.wikimedia.org/T203239 (10Krinkle) [02:34:23] 10Operations, 10Core Platform Team Backlog (Watching / External), 10Readers-Web-Backlog (Tracking), 10Services (watching): Create Debian packages for Node.js 10 upgrade - https://phabricator.wikimedia.org/T203239 (10Krinkle) 05Open→03Resolved [02:35:11] (03PS2) 10Krinkle: Disable Navigation Timing on closed/private/fishbowl wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481212 [02:35:39] James_F: done deploying I assume? [02:36:04] Yes. [02:36:39] (03CR) 10Krinkle: [C: 03+2] Disable Navigation Timing on closed/private/fishbowl wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481212 (owner: 10Krinkle) [02:37:42] (03Merged) 10jenkins-bot: Disable Navigation Timing on closed/private/fishbowl wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481212 (owner: 10Krinkle) [02:38:08] * Krinkle staging on mwdebug1002 [02:39:55] * Krinkle sees mw.loader.moduleRegistry['ext.navigationTiming'] being null on aawiki but still valid on enwiki [02:40:15] !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Ib87407165382 (duration: 00m 46s) [02:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:49:08] (03CR) 10jenkins-bot: Disable Navigation Timing on closed/private/fishbowl wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481212 (owner: 10Krinkle) [03:37:45] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 933.29 seconds [04:40:39] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 283.83 seconds [05:22:40] (03CR) 10Krinkle: "As general advise, if you use an editor or other software that for you own usage creates something in every git-repo you work on, it may h" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482501 (owner: 10Zoranzoki21) [05:22:43] (03CR) 10Krinkle: [C: 03+2] .gitignore: Add Visual Studio Code in editors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482501 (owner: 10Zoranzoki21) [05:23:46] (03Merged) 10jenkins-bot: .gitignore: Add Visual Studio Code in editors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482501 (owner: 10Zoranzoki21) [05:23:55] (03CR) 10Zoranzoki21: "> As general advise, if you use an editor or other software that for" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482501 (owner: 10Zoranzoki21) [05:28:00] (03CR) 10Krinkle: [C: 03+2] "OK. That is useful I think :) I would like to know how it helps the newcomer? As I understand it, untracked files (files not part of git) " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482501 (owner: 10Zoranzoki21) [05:33:02] (03CR) 10jenkins-bot: .gitignore: Add Visual Studio Code in editors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482501 (owner: 10Zoranzoki21) [05:49:54] (03CR) 10Krinkle: Demistify $wmgMonologChannels Logstash debug level behavior (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483339 (owner: 10Gergő Tisza) [06:31:39] PROBLEM - puppet last run on mw1323 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/cgroup-mediawiki-clean] [06:34:39] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/check_confd_template] [06:45:54] 10Operations, 10Cloud-VPS, 10Traffic, 10serviceops: Difficulties to create offline version of Wikipedia because of HTTP 429 response - https://phabricator.wikimedia.org/T213475 (10Kelson) [07:00:31] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:02:51] RECOVERY - puppet last run on mw1323 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [07:29:13] 10Operations, 10Puppet, 10Continuous-Integration-Config, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Get rid of "import realm.pp" in manifests/site.pp - https://phabricator.wikimedia.org/T154915 (10hashar) 05Open→03Resolved a:03hashar [07:56:02] (03CR) 10Mathew.onipe: Elasticsearch failed shard allocation check (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/482297 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [07:56:59] (03PS5) 10Mathew.onipe: Elasticsearch failed shard allocation check [puppet] - 10https://gerrit.wikimedia.org/r/482297 (https://phabricator.wikimedia.org/T212850) [08:05:50] (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for uwsgi-striker [puppet] - 10https://gerrit.wikimedia.org/r/483114 (https://phabricator.wikimedia.org/T135991) [08:13:04] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T213397 (10noarave) @RStallman-legalteam - NDA has been read and signed. Thank you. [08:13:50] (03PS1) 10Alexandros Kosiaris: zotero: Add release matching to service [deployment-charts] - 10https://gerrit.wikimedia.org/r/483685 [08:21:31] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] zotero: Add release matching to service [deployment-charts] - 10https://gerrit.wikimedia.org/r/483685 (owner: 10Alexandros Kosiaris) [08:43:31] PROBLEM - HHVM rendering on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:44:35] RECOVERY - HHVM rendering on mw1224 is OK: HTTP OK: HTTP/1.1 200 OK - 81211 bytes in 1.159 second response time [08:49:19] !log installing tmpreaper security updates [08:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:39] (03CR) 10Filippo Giunchedi: [C: 04-1] "Thanks Petr!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/483653 (https://phabricator.wikimedia.org/T213506) (owner: 10Ppchelko) [08:59:17] PROBLEM - puppet last run on mw2222 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [09:04:43] (03CR) 10Filippo Giunchedi: [C: 03+1] hiera: add cluster definition to syslog servers [puppet] - 10https://gerrit.wikimedia.org/r/483612 (https://phabricator.wikimedia.org/T210486) (owner: 10Cwhite) [09:05:09] (03CR) 10Filippo Giunchedi: [C: 03+1] hiera: add cluster definition to snapshot servers [puppet] - 10https://gerrit.wikimedia.org/r/483602 (https://phabricator.wikimedia.org/T210486) (owner: 10Cwhite) [09:07:01] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:07:41] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:07:58] (03CR) 10DCausse: [C: 04-1] Elasticsearch failed shard allocation check (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/482297 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [09:09:21] (03PS1) 10Elukey: systemd::syslog: allow to modify the $local_logdir convention [puppet] - 10https://gerrit.wikimedia.org/r/483691 (https://phabricator.wikimedia.org/T172532) [09:12:18] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/14285/" [puppet] - 10https://gerrit.wikimedia.org/r/483691 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [09:15:19] PROBLEM - puppet last run on proton1002 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 7 minutes ago with 4 failures. Failed resources (up to 3 shown): Package[tzdata],Package[apport],Package[command-not-found],Package[libpng12-0] [09:17:54] (03CR) 10ArielGlenn: "Can we add in dumpsdata1001,2 and francium, and call the cluster 'dumps'? The hosts don't all do the same work but this way they are all o" [puppet] - 10https://gerrit.wikimedia.org/r/483602 (https://phabricator.wikimedia.org/T210486) (owner: 10Cwhite) [09:21:45] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:22:25] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 126, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:26:10] (03PS2) 10Elukey: systemd::syslog: allow to modify the $local_logdir convention [puppet] - 10https://gerrit.wikimedia.org/r/483691 (https://phabricator.wikimedia.org/T172532) [09:26:32] 10Operations: Prepare our base system layer for Debian buster - https://phabricator.wikimedia.org/T213527 (10MoritzMuehlenhoff) [09:27:41] (03PS1) 10Muehlenhoff: Add support for buster-wikimedia to our internal repository [puppet] - 10https://gerrit.wikimedia.org/r/483694 (https://phabricator.wikimedia.org/T213527) [09:28:33] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/14286/" [puppet] - 10https://gerrit.wikimedia.org/r/483691 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [09:30:31] RECOVERY - puppet last run on mw2222 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:32:05] !log reset iLo on db2053 [09:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:53] (03PS1) 10Muehlenhoff: puppetmasters: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/483695 [09:41:15] RECOVERY - puppet last run on proton1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:42:15] 10Operations, 10serviceops: Canaries canaries canaries - https://phabricator.wikimedia.org/T210143 (10jijiki) [09:43:09] (03PS1) 10Jcrespo: mariadb: Depool es2013 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483696 [09:44:40] 10Operations, 10Patch-For-Review: Prepare our base system layer for Debian buster - https://phabricator.wikimedia.org/T213527 (10jcrespo) [09:44:42] 10Operations, 10DBA, 10MediaWiki-Database, 10Patch-For-Review: Evaluate and decide the future of relational datastore at WMF after the upgrade of MariaDB 10.1 is finished - https://phabricator.wikimedia.org/T193224 (10jcrespo) [09:44:59] (03CR) 10ArielGlenn: "abogott-scapserver.testlabs.eqiad.wmflabs is jessie and has the deployment_server role, so it's got hhvm on it. I didn't find any other ho" [puppet] - 10https://gerrit.wikimedia.org/r/483381 (owner: 10Muehlenhoff) [09:45:00] 10Operations, 10Patch-For-Review: Prepare our base system layer for Debian buster - https://phabricator.wikimedia.org/T213527 (10jcrespo) [09:46:35] (03CR) 10Jcrespo: [C: 03+2] mariadb: Depool es2013 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483696 (owner: 10Jcrespo) [09:47:45] (03Merged) 10jenkins-bot: mariadb: Depool es2013 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483696 (owner: 10Jcrespo) [09:48:04] (03PS1) 10Elukey: systemd::syslog|timer: add proper handling of ensure [puppet] - 10https://gerrit.wikimedia.org/r/483698 (https://phabricator.wikimedia.org/T172532) [09:48:48] I just did a noop deploy, I will have to repeat it [09:49:19] !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: depool es2013 (duration: 00m 47s) [09:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:04] (03CR) 10Gehel: [C: 04-1] "minor comments" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/482297 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [09:53:08] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/14287/" [puppet] - 10https://gerrit.wikimedia.org/r/483698 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [09:53:10] !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: depool es2013 (duration: 00m 45s) [09:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:08] (03PS1) 10Jcrespo: Revert "mariadb: Depool es1019 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483700 [09:54:22] (03PS1) 10Jcrespo: Revert "mariadb: Depool es2013 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483701 [09:58:02] (03CR) 10jenkins-bot: mariadb: Depool es2013 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483696 (owner: 10Jcrespo) [09:58:27] !log upgrade and reboot es2013 [09:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:38] 10Operations, 10serviceops, 10Patch-For-Review, 10User-jijiki: Create a mediawiki::cronjob define - https://phabricator.wikimedia.org/T211250 (10jijiki) [10:27:54] (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb: Depool es2013 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483701 (owner: 10Jcrespo) [10:28:44] (03CR) 10DCausse: [C: 04-1] Elasticsearch failed shard allocation check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/482297 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [10:30:15] (03Merged) 10jenkins-bot: Revert "mariadb: Depool es2013 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483701 (owner: 10Jcrespo) [10:30:37] !log upgrade and restart es1018 [10:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:55] !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: repool es2013 (duration: 00m 45s) [10:31:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:07] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194 (10MoritzMuehlenhoff) The 4.9.144-1 kernel is fully production-ready, the point releases for Debian are used to rebase the Stretch kernel to the latest set of 4.9.x bug f... [10:35:19] (03CR) 10jenkins-bot: Revert "mariadb: Depool es2013 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483701 (owner: 10Jcrespo) [10:35:52] (03CR) 10Gehel: [C: 03+1] "I'm ok with this as a short time fix, but it looks like cumin needs some more love on the longer term." (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/483613 (https://phabricator.wikimedia.org/T213296) (owner: 10Volans) [10:36:11] (03CR) 10Cparle: [C: 03+1] Clean-up: Explain why WBMI wikis don't need wmgWikibaseRepoEntityNamespaces set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483645 (owner: 10Jforrester) [10:39:59] (03PS1) 10Elukey: profile::reportupdater::jobs::hadoop: move jobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/483715 (https://phabricator.wikimedia.org/T172532) [10:40:37] (03CR) 10DCausse: [C: 04-1] Elasticsearch failed shard allocation check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/482297 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [10:41:05] (03PS1) 10Jcrespo: mariadb: Repool es1018 with low load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483717 [10:43:01] (03PS2) 10Elukey: profile::reportupdater::jobs::hadoop: move jobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/483715 (https://phabricator.wikimedia.org/T172532) [10:47:54] (03CR) 10Gehel: [C: 04-1] "looks good, minor comments inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/483529 (https://phabricator.wikimedia.org/T213234) (owner: 10Smalyshev) [10:48:08] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/14289/" [puppet] - 10https://gerrit.wikimedia.org/r/483715 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [10:48:41] (03CR) 10Jcrespo: [C: 03+2] mariadb: Repool es1018 with low load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483717 (owner: 10Jcrespo) [10:49:07] (03PS2) 10Jcrespo: Revert "mariadb: Depool es1018 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483403 [10:49:48] (03Merged) 10jenkins-bot: mariadb: Repool es1018 with low load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483717 (owner: 10Jcrespo) [10:51:19] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: repool es1018 with low load (duration: 00m 46s) [10:51:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:00] (03PS3) 10Jcrespo: Revert "mariadb: Depool es1018 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483403 [10:59:59] (03CR) 10jenkins-bot: mariadb: Repool es1018 with low load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483717 (owner: 10Jcrespo) [11:04:22] !log stop, upgrade and reboot es2016 [11:04:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:24] (03CR) 10Volans: [C: 03+2] "> Patch Set 1: Code-Review+1" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/483613 (https://phabricator.wikimedia.org/T213296) (owner: 10Volans) [11:15:19] (03Merged) 10jenkins-bot: remote: fix logging for reboot() [software/spicerack] - 10https://gerrit.wikimedia.org/r/483613 (https://phabricator.wikimedia.org/T213296) (owner: 10Volans) [11:15:44] 10Operations, 10CirrusSearch, 10Discovery-Search, 10serviceops: Find an alternative to HHVM curl connection pooling for PHP 7 - https://phabricator.wikimedia.org/T210717 (10Joe) I did some *very* lame benchmarking of the response of the banner url for elasticsearch (`/`), with the following code: ` (03CR) 10jenkins-bot: remote: fix logging for reboot() [software/spicerack] - 10https://gerrit.wikimedia.org/r/483613 (https://phabricator.wikimedia.org/T213296) (owner: 10Volans) [11:18:19] 10Operations, 10Patch-For-Review: Onboarding John Bond - https://phabricator.wikimedia.org/T213079 (10MoritzMuehlenhoff) [11:21:29] !log stop, upgrade and reboot es2017 [11:21:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:43] (03PS3) 10Elukey: Introduce role::analytics_test_cluster [puppet] - 10https://gerrit.wikimedia.org/r/482645 (https://phabricator.wikimedia.org/T212256) [11:25:07] (03PS1) 10Elukey: role::analytics_cluster::hue: remove leftovers [puppet] - 10https://gerrit.wikimedia.org/r/483727 [11:25:20] 10Operations, 10TCB-Team, 10WMF-JobQueue, 10monitoring, and 3 others: Grafana alerting broken after upgrade to 5.0.0 - https://phabricator.wikimedia.org/T213506 (10fgiunchedi) [11:26:08] (03PS1) 10Vgutierrez: certcentral: Set authorized_hosts or regexes for every cert [puppet] - 10https://gerrit.wikimedia.org/r/483728 (https://phabricator.wikimedia.org/T213301) [11:28:00] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::hue: remove leftovers [puppet] - 10https://gerrit.wikimedia.org/r/483727 (owner: 10Elukey) [11:30:34] (03CR) 10Filippo Giunchedi: [C: 04-1] "Thanks for getting the ball rolling!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/483694 (https://phabricator.wikimedia.org/T213527) (owner: 10Muehlenhoff) [11:31:38] (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb: Depool es1018 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483403 (owner: 10Jcrespo) [11:32:43] (03Merged) 10jenkins-bot: Revert "mariadb: Depool es1018 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483403 (owner: 10Jcrespo) [11:34:44] (03CR) 10Filippo Giunchedi: "> Patch Set 29:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/482275 (https://phabricator.wikimedia.org/T212949) (owner: 10Elukey) [11:36:13] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: repool es1018 fully (duration: 00m 46s) [11:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:25] (03CR) 10jenkins-bot: Revert "mariadb: Depool es1018 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483403 (owner: 10Jcrespo) [11:37:39] (03PS18) 10GTirloni: wmcs::nfs::misc - Refactor into profile/role [puppet] - 10https://gerrit.wikimedia.org/r/482051 (https://phabricator.wikimedia.org/T209527) [11:42:52] (03PS1) 10Arturo Borrero Gonzalez: toolforge: refactor docker builder profile [puppet] - 10https://gerrit.wikimedia.org/r/483731 (https://phabricator.wikimedia.org/T213418) [11:42:55] (03PS19) 10GTirloni: wmcs::nfs::misc - Refactor into profile/role [puppet] - 10https://gerrit.wikimedia.org/r/482051 (https://phabricator.wikimedia.org/T209527) [11:44:14] (03CR) 10Elukey: admin: allow users to be deployed without ssh keys configured (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/482275 (https://phabricator.wikimedia.org/T212949) (owner: 10Elukey) [11:49:09] (03CR) 10GTirloni: [C: 03+2] toolforge: refactor docker builder profile [puppet] - 10https://gerrit.wikimedia.org/r/483731 (https://phabricator.wikimedia.org/T213418) (owner: 10Arturo Borrero Gonzalez) [11:56:10] (03PS31) 10Elukey: admin: allow users to be deployed without ssh keys configured [puppet] - 10https://gerrit.wikimedia.org/r/482275 (https://phabricator.wikimedia.org/T212949) [12:00:19] (03PS1) 10MacFan4000: Update ext dist settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483735 [12:00:20] (03CR) 10Elukey: "I just noticed something weird in the pcc's output:" [puppet] - 10https://gerrit.wikimedia.org/r/482275 (https://phabricator.wikimedia.org/T212949) (owner: 10Elukey) [12:02:34] (03PS1) 10Tulsi Bhagat: Configure $wgImportSources for ne.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483737 (https://phabricator.wikimedia.org/T213023) [12:03:02] (03PS20) 10GTirloni: wmcs::nfs::misc - Refactor into profile/role [puppet] - 10https://gerrit.wikimedia.org/r/482051 (https://phabricator.wikimedia.org/T209527) [12:03:23] (03PS2) 10MacFan4000: Update ext dist settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483735 [12:04:19] (03CR) 10GTirloni: [C: 03+2] wmcs::nfs::misc - Refactor into profile/role [puppet] - 10https://gerrit.wikimedia.org/r/482051 (https://phabricator.wikimedia.org/T209527) (owner: 10GTirloni) [12:11:55] (03PS32) 10Elukey: admin: allow users to be deployed without ssh keys configured [puppet] - 10https://gerrit.wikimedia.org/r/482275 (https://phabricator.wikimedia.org/T212949) [12:16:42] PROBLEM - puppet last run on cloudstore1009 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 11 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/etc/nsswitch.conf],Mount[/srv/scratch],Mount[/srv/maps] [12:18:48] PROBLEM - puppet last run on cloudstore1008 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 4 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/etc/nsswitch.conf],Mount[/srv/scratch],Mount[/srv/maps] [12:18:59] (03CR) 10Elukey: "For some reason https://puppet-compiler.wmflabs.org/compiler1002/14293/an-master1001.eqiad.wmnet/ does the right thing (removing the ssh k" [puppet] - 10https://gerrit.wikimedia.org/r/482275 (https://phabricator.wikimedia.org/T212949) (owner: 10Elukey) [12:19:02] (03CR) 10Muehlenhoff: Add support for buster-wikimedia to our internal repository (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/483694 (https://phabricator.wikimedia.org/T213527) (owner: 10Muehlenhoff) [12:20:37] (03PS2) 10Muehlenhoff: Add support for buster-wikimedia to our internal repository [puppet] - 10https://gerrit.wikimedia.org/r/483694 (https://phabricator.wikimedia.org/T213527) [12:22:34] (03PS1) 10Arturo Borrero Gonzalez: toolforge: clush: update references to tools-docker-builder [puppet] - 10https://gerrit.wikimedia.org/r/483739 (https://phabricator.wikimedia.org/T213418) [12:23:46] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: clush: update references to tools-docker-builder [puppet] - 10https://gerrit.wikimedia.org/r/483739 (https://phabricator.wikimedia.org/T213418) (owner: 10Arturo Borrero Gonzalez) [12:27:01] PROBLEM - High load average on cloudstore1008 is CRITICAL: (null) https://grafana.wikimedia.org/dashboard/db/labs-monitoring [12:30:43] 10Operations, 10Patch-For-Review: Onboarding John Bond - https://phabricator.wikimedia.org/T213079 (10MoritzMuehlenhoff) [12:31:35] (03PS3) 10Elukey: systemd::syslog: allow to modify the $local_logdir convention [puppet] - 10https://gerrit.wikimedia.org/r/483691 (https://phabricator.wikimedia.org/T172532) [12:31:37] (03PS2) 10Elukey: systemd::syslog|timer: add proper handling of ensure [puppet] - 10https://gerrit.wikimedia.org/r/483698 (https://phabricator.wikimedia.org/T172532) [12:31:39] (03PS3) 10Elukey: profile::reportupdater::jobs::hadoop: move jobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/483715 (https://phabricator.wikimedia.org/T172532) [12:32:14] (03CR) 10jerkins-bot: [V: 04-1] systemd::syslog: allow to modify the $local_logdir convention [puppet] - 10https://gerrit.wikimedia.org/r/483691 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [12:32:32] course [12:32:37] (03CR) 10jerkins-bot: [V: 04-1] systemd::syslog|timer: add proper handling of ensure [puppet] - 10https://gerrit.wikimedia.org/r/483698 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [12:35:59] (03PS4) 10Elukey: systemd::syslog: allow to modify the $local_logdir convention [puppet] - 10https://gerrit.wikimedia.org/r/483691 (https://phabricator.wikimedia.org/T172532) [12:36:01] (03PS3) 10Elukey: systemd::syslog|timer: add proper handling of ensure [puppet] - 10https://gerrit.wikimedia.org/r/483698 (https://phabricator.wikimedia.org/T172532) [12:36:03] (03PS4) 10Elukey: profile::reportupdater::jobs::hadoop: move jobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/483715 (https://phabricator.wikimedia.org/T172532) [12:38:10] PROBLEM - High load average on cloudstore1009 is CRITICAL: (null) https://grafana.wikimedia.org/dashboard/db/labs-monitoring [12:51:55] 10Operations, 10netops, 10Performance-Team (Radar): Stop prioritizing peering over transit - https://phabricator.wikimedia.org/T204281 (10Gilles) Following @ayounsi's request, I've put together per-DC real user monitoring performance metrics using the following Hive query: ` SELECT day, SUBSTR(recvfrom, 8,... [12:58:04] (03CR) 10Aklapper: [C: 03+1] "@greg: Some general info on https://wikitech.wikimedia.org/wiki/Phabricator#Administrative_Commands ; feel free to reach out if questions" [puppet] - 10https://gerrit.wikimedia.org/r/483623 (owner: 10Dzahn) [13:11:47] (03CR) 10Nuria: "Nice, thanks for doing this work" [puppet] - 10https://gerrit.wikimedia.org/r/483715 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [13:26:08] 10Operations, 10Patch-For-Review: Onboarding John Bond - https://phabricator.wikimedia.org/T213079 (10MoritzMuehlenhoff) 05Open→03Resolved [13:27:35] (03CR) 10Giuseppe Lavagetto: "Some more comments" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/482675 (https://phabricator.wikimedia.org/T210076) (owner: 10Fsero) [13:32:42] (03CR) 10Andrew Bogott: [C: 03+1] "As long as you're sure this isn't used in deployment-prep I'm fine with it." [puppet] - 10https://gerrit.wikimedia.org/r/483381 (owner: 10Muehlenhoff) [13:44:25] (03PS5) 10BBlack: authdns: add NRPE for gdnsd checkconf [puppet] - 10https://gerrit.wikimedia.org/r/483187 [13:44:27] (03PS3) 10BBlack: authdns: reload (replace) gdnsd on config changes [puppet] - 10https://gerrit.wikimedia.org/r/483188 [13:44:29] (03PS2) 10BBlack: authdns: listen for local PROXY, min v6 threads [puppet] - 10https://gerrit.wikimedia.org/r/483470 [13:45:07] (03CR) 10jerkins-bot: [V: 04-1] authdns: add NRPE for gdnsd checkconf [puppet] - 10https://gerrit.wikimedia.org/r/483187 (owner: 10BBlack) [13:45:30] (03CR) 10jerkins-bot: [V: 04-1] authdns: reload (replace) gdnsd on config changes [puppet] - 10https://gerrit.wikimedia.org/r/483188 (owner: 10BBlack) [13:45:53] (03CR) 10jerkins-bot: [V: 04-1] authdns: listen for local PROXY, min v6 threads [puppet] - 10https://gerrit.wikimedia.org/r/483470 (owner: 10BBlack) [13:47:39] (03PS1) 10Arturo Borrero Gonzalez: toolforge: docker builder: missing infrastructure profile [puppet] - 10https://gerrit.wikimedia.org/r/483763 (https://phabricator.wikimedia.org/T213418) [13:49:10] (03PS2) 10Arturo Borrero Gonzalez: toolforge: docker builder: missing infrastructure profile [puppet] - 10https://gerrit.wikimedia.org/r/483763 (https://phabricator.wikimedia.org/T213418) [13:50:40] (03PS6) 10BBlack: authdns: add NRPE for gdnsd checkconf [puppet] - 10https://gerrit.wikimedia.org/r/483187 [13:50:42] (03PS4) 10BBlack: authdns: reload (replace) gdnsd on config changes [puppet] - 10https://gerrit.wikimedia.org/r/483188 [13:50:44] (03PS3) 10BBlack: authdns: listen for local PROXY, min v6 threads [puppet] - 10https://gerrit.wikimedia.org/r/483470 [13:51:29] (03CR) 10jerkins-bot: [V: 04-1] authdns: add NRPE for gdnsd checkconf [puppet] - 10https://gerrit.wikimedia.org/r/483187 (owner: 10BBlack) [13:51:42] (03CR) 10jerkins-bot: [V: 04-1] authdns: reload (replace) gdnsd on config changes [puppet] - 10https://gerrit.wikimedia.org/r/483188 (owner: 10BBlack) [13:51:47] (03CR) 10jerkins-bot: [V: 04-1] authdns: listen for local PROXY, min v6 threads [puppet] - 10https://gerrit.wikimedia.org/r/483470 (owner: 10BBlack) [13:53:02] (03Abandoned) 10Andrew Bogott: cloudvirt1013: enable alerts [puppet] - 10https://gerrit.wikimedia.org/r/481197 (https://phabricator.wikimedia.org/T212513) (owner: 10Andrew Bogott) [13:53:10] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: docker builder: missing infrastructure profile [puppet] - 10https://gerrit.wikimedia.org/r/483763 (https://phabricator.wikimedia.org/T213418) (owner: 10Arturo Borrero Gonzalez) [13:55:44] (03CR) 10Filippo Giunchedi: [C: 03+1] Add support for buster-wikimedia to our internal repository [puppet] - 10https://gerrit.wikimedia.org/r/483694 (https://phabricator.wikimedia.org/T213527) (owner: 10Muehlenhoff) [13:57:22] (03PS9) 10Gehel: Prepare for multi-instance Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/483529 (https://phabricator.wikimedia.org/T213234) (owner: 10Smalyshev) [13:57:35] (03PS15) 10Fsero: Initial docker::registry::ha puppetization. [puppet] - 10https://gerrit.wikimedia.org/r/482675 (https://phabricator.wikimedia.org/T210076) [13:58:25] (03PS7) 10BBlack: authdns: add NRPE for gdnsd checkconf [puppet] - 10https://gerrit.wikimedia.org/r/483187 [13:58:27] (03PS5) 10BBlack: authdns: reload (replace) gdnsd on config changes [puppet] - 10https://gerrit.wikimedia.org/r/483188 [13:58:29] (03PS4) 10BBlack: authdns: listen for local PROXY, min v6 threads [puppet] - 10https://gerrit.wikimedia.org/r/483470 [13:59:08] (03CR) 10jerkins-bot: [V: 04-1] authdns: add NRPE for gdnsd checkconf [puppet] - 10https://gerrit.wikimedia.org/r/483187 (owner: 10BBlack) [13:59:28] (03CR) 10jerkins-bot: [V: 04-1] authdns: reload (replace) gdnsd on config changes [puppet] - 10https://gerrit.wikimedia.org/r/483188 (owner: 10BBlack) [13:59:34] (03CR) 10jerkins-bot: [V: 04-1] authdns: listen for local PROXY, min v6 threads [puppet] - 10https://gerrit.wikimedia.org/r/483470 (owner: 10BBlack) [14:02:22] (03PS16) 10Fsero: Initial docker::registry::ha puppetization. [puppet] - 10https://gerrit.wikimedia.org/r/482675 (https://phabricator.wikimedia.org/T210076) [14:05:02] (03PS8) 10BBlack: authdns: add NRPE for gdnsd checkconf [puppet] - 10https://gerrit.wikimedia.org/r/483187 [14:05:04] (03PS6) 10BBlack: authdns: reload (replace) gdnsd on config changes [puppet] - 10https://gerrit.wikimedia.org/r/483188 [14:05:06] (03PS5) 10BBlack: authdns: listen for local PROXY, min v6 threads [puppet] - 10https://gerrit.wikimedia.org/r/483470 [14:05:35] (03CR) 10jerkins-bot: [V: 04-1] authdns: add NRPE for gdnsd checkconf [puppet] - 10https://gerrit.wikimedia.org/r/483187 (owner: 10BBlack) [14:06:03] (03CR) 10jerkins-bot: [V: 04-1] authdns: reload (replace) gdnsd on config changes [puppet] - 10https://gerrit.wikimedia.org/r/483188 (owner: 10BBlack) [14:06:10] (03CR) 10jerkins-bot: [V: 04-1] authdns: listen for local PROXY, min v6 threads [puppet] - 10https://gerrit.wikimedia.org/r/483470 (owner: 10BBlack) [14:08:03] (03PS17) 10Fsero: Initial docker::registry::ha puppetization. [puppet] - 10https://gerrit.wikimedia.org/r/482675 (https://phabricator.wikimedia.org/T210076) [14:11:00] (03PS9) 10BBlack: authdns: add NRPE for gdnsd checkconf [puppet] - 10https://gerrit.wikimedia.org/r/483187 [14:11:02] (03PS7) 10BBlack: authdns: reload (replace) gdnsd on config changes [puppet] - 10https://gerrit.wikimedia.org/r/483188 [14:11:04] (03PS6) 10BBlack: authdns: listen for local PROXY, min v6 threads [puppet] - 10https://gerrit.wikimedia.org/r/483470 [14:11:41] (03CR) 10jerkins-bot: [V: 04-1] authdns: add NRPE for gdnsd checkconf [puppet] - 10https://gerrit.wikimedia.org/r/483187 (owner: 10BBlack) [14:12:01] (03CR) 10jerkins-bot: [V: 04-1] authdns: reload (replace) gdnsd on config changes [puppet] - 10https://gerrit.wikimedia.org/r/483188 (owner: 10BBlack) [14:12:08] (03CR) 10jerkins-bot: [V: 04-1] authdns: listen for local PROXY, min v6 threads [puppet] - 10https://gerrit.wikimedia.org/r/483470 (owner: 10BBlack) [14:12:12] (03CR) 10Fsero: "Added more specific datatypes thanks for the review :)" [puppet] - 10https://gerrit.wikimedia.org/r/482675 (https://phabricator.wikimedia.org/T210076) (owner: 10Fsero) [14:12:23] !log updating mariadb client packages on cumin* hosts [14:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:33] (03PS1) 10Arturo Borrero Gonzalez: toolforge: refactor docker registry profile [puppet] - 10https://gerrit.wikimedia.org/r/483765 (https://phabricator.wikimedia.org/T213418) [14:13:14] (03PS10) 10BBlack: authdns: add NRPE for gdnsd checkconf [puppet] - 10https://gerrit.wikimedia.org/r/483187 [14:13:16] (03PS8) 10BBlack: authdns: reload (replace) gdnsd on config changes [puppet] - 10https://gerrit.wikimedia.org/r/483188 [14:13:18] (03PS7) 10BBlack: authdns: listen for local PROXY, min v6 threads [puppet] - 10https://gerrit.wikimedia.org/r/483470 [14:19:16] (03PS10) 10Gehel: Prepare for multi-instance Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/483529 (https://phabricator.wikimedia.org/T213234) (owner: 10Smalyshev) [14:20:32] (03CR) 10MarcoAurelio: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/483623 (owner: 10Dzahn) [14:20:53] (03CR) 10Gehel: [C: 03+2] Prepare for multi-instance Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/483529 (https://phabricator.wikimedia.org/T213234) (owner: 10Smalyshev) [14:23:57] (03CR) 10Zoranzoki21: [C: 03+1] admins: add Greg to phabricator-admins [puppet] - 10https://gerrit.wikimedia.org/r/483623 (owner: 10Dzahn) [14:24:08] (03CR) 10BBlack: [C: 03+2] authdns: add NRPE for gdnsd checkconf [puppet] - 10https://gerrit.wikimedia.org/r/483187 (owner: 10BBlack) [14:24:17] (03PS11) 10BBlack: authdns: add NRPE for gdnsd checkconf [puppet] - 10https://gerrit.wikimedia.org/r/483187 [14:37:30] !log upgrade and restart db2091 (s2, s4) [14:37:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:52] (03CR) 10Ottomata: [C: 03+1] "Haven't reviewed all the files here, but I assume it is just like analtyics_cluster. +1. Re. analytics-tool, we can just include those r" [puppet] - 10https://gerrit.wikimedia.org/r/482645 (https://phabricator.wikimedia.org/T212256) (owner: 10Elukey) [14:51:17] !log anomie@mwmaint1002 Running migrateActors.php on test wikis and mediawikiwiki for T188327. This may cause lag in codfw. [14:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:20] T188327: Deploy refactored actor storage - https://phabricator.wikimedia.org/T188327 [14:51:44] ^ wrong button, that was yesterday [14:52:55] anomie: Can you run refreshing special pages on srwikinews? [14:54:01] Zoranzoki21: Is there a task for it? [14:54:09] anomie: No [14:55:04] anomie: But there is https://phabricator.wikimedia.org/T212346 [14:55:20] I asked because of https://phabricator.wikimedia.org/T212346 [14:58:04] Zoranzoki21: It would probably be best to make a task with the details of what exactly needs to be run. Tag it with #Wikimedia-Site-requests. [14:58:17] anomie: Ok, will do it later [15:00:34] (03PS9) 10BBlack: authdns: reload (replace) gdnsd on config changes [puppet] - 10https://gerrit.wikimedia.org/r/483188 [15:00:38] (03PS8) 10BBlack: authdns: listen for local PROXY, min v6 threads [puppet] - 10https://gerrit.wikimedia.org/r/483470 [15:04:41] (03CR) 10BBlack: [C: 03+2] authdns: reload (replace) gdnsd on config changes [puppet] - 10https://gerrit.wikimedia.org/r/483188 (owner: 10BBlack) [15:05:02] (03PS1) 10Hashar: doc: fix redirect of dir lacking a trailing slash [puppet] - 10https://gerrit.wikimedia.org/r/483775 (https://phabricator.wikimedia.org/T95164) [15:07:09] 10Operations: Prepare puppet for Debian buster - https://phabricator.wikimedia.org/T213546 (10MoritzMuehlenhoff) [15:08:10] (03CR) 10BBlack: [C: 03+2] authdns: listen for local PROXY, min v6 threads [puppet] - 10https://gerrit.wikimedia.org/r/483470 (owner: 10BBlack) [15:10:08] (03CR) 10Hashar: "I have manually hacked it on contint1001 since it still has the configuration for doc.wikimedia.org but does not serve any traffic :]" [puppet] - 10https://gerrit.wikimedia.org/r/483775 (https://phabricator.wikimedia.org/T95164) (owner: 10Hashar) [15:12:38] (03PS1) 10BBlack: Revert "authdns: listen for local PROXY, min v6 threads" [puppet] - 10https://gerrit.wikimedia.org/r/483779 [15:13:07] (03CR) 10BBlack: [V: 03+2 C: 03+2] Revert "authdns: listen for local PROXY, min v6 threads" [puppet] - 10https://gerrit.wikimedia.org/r/483779 (owner: 10BBlack) [15:13:38] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Replace Torrus with Prometheus snmp_exporter for PDUs monitoring - https://phabricator.wikimedia.org/T148541 (10mark) p:05Normal→03High Right now I can only find a single graph with eqiad/codfw total (aggregated) power usage, but proper... [15:13:52] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Replace Torrus with Prometheus snmp_exporter for PDUs monitoring - https://phabricator.wikimedia.org/T148541 (10mark) 05Declined→03Open [15:17:07] (03CR) 10Hashar: "A potential alternative would be to include the protocol in the ServerName and thus do:" [puppet] - 10https://gerrit.wikimedia.org/r/483775 (https://phabricator.wikimedia.org/T95164) (owner: 10Hashar) [15:21:59] (03PS2) 10Hashar: doc: fix Apache redirects to use https [puppet] - 10https://gerrit.wikimedia.org/r/483775 (https://phabricator.wikimedia.org/T95164) [15:23:29] (03CR) 10Hashar: [V: 03+1] "PS2 is the simpler version. It enables again DirectorySlash and adds https to the ServerName." [puppet] - 10https://gerrit.wikimedia.org/r/483775 (https://phabricator.wikimedia.org/T95164) (owner: 10Hashar) [15:24:29] (03CR) 10Hashar: [V: 03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/483775 (https://phabricator.wikimedia.org/T95164) (owner: 10Hashar) [15:30:44] 10Operations, 10CommRel-Specialists-Support (Jan-Mar-2019), 10User-Johan: Lessons learned: Communicating the server switch 2018 - https://phabricator.wikimedia.org/T206649 (10Elitre) [15:31:06] 10Operations, 10CommRel-Specialists-Support (Jan-Mar-2019), 10Goal, 10User-Johan: Community Relations support for the 2018 data center switchover - https://phabricator.wikimedia.org/T199676 (10Elitre) [15:36:57] (03PS4) 10Muehlenhoff: Switch Thumbor hardening from Firejail to native systemd features (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/482309 (https://phabricator.wikimedia.org/T212941) [15:37:58] (03CR) 10Ottomata: [C: 03+1] mathoid: Move config.yaml into a template (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/483184 (owner: 10Alexandros Kosiaris) [15:41:10] (03PS5) 10Muehlenhoff: Switch Thumbor hardening from Firejail to native systemd features (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/482309 (https://phabricator.wikimedia.org/T212941) [15:41:34] (03CR) 10Ottomata: [C: 03+1] mathoid: Move config.yaml into a template (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/483184 (owner: 10Alexandros Kosiaris) [15:41:42] (03CR) 10Ottomata: [C: 03+1] mathoid: Move config.yaml into a template (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/483184 (owner: 10Alexandros Kosiaris) [15:45:24] (03PS2) 10Mark Bergsma: Expand Coordinator.resultUp behavior on first monitor check result [debs/pybal] - 10https://gerrit.wikimedia.org/r/478203 [15:45:56] (03CR) 10Mark Bergsma: [C: 03+2] Don't depool pooledDownServers in refreshPreexistingServer [debs/pybal] - 10https://gerrit.wikimedia.org/r/447769 (https://phabricator.wikimedia.org/T184715) (owner: 10Mark Bergsma) [15:47:07] (03Merged) 10jenkins-bot: Don't depool pooledDownServers in refreshPreexistingServer [debs/pybal] - 10https://gerrit.wikimedia.org/r/447769 (https://phabricator.wikimedia.org/T184715) (owner: 10Mark Bergsma) [15:49:10] (03PS1) 10Giuseppe Lavagetto: profile::services_proxy: simple local proxying for remote services [puppet] - 10https://gerrit.wikimedia.org/r/483788 (https://phabricator.wikimedia.org/T210717) [15:49:12] (03PS1) 10Giuseppe Lavagetto: mediawiki::common: add proxy for services [puppet] - 10https://gerrit.wikimedia.org/r/483789 (https://phabricator.wikimedia.org/T210717) [15:49:49] (03CR) 10Mark Bergsma: [C: 03+2] Wait for onConfigUpdate initialization in setServers using inlineCallbacks [debs/pybal] - 10https://gerrit.wikimedia.org/r/477793 (owner: 10Mark Bergsma) [15:49:59] (03CR) 10jerkins-bot: [V: 04-1] profile::services_proxy: simple local proxying for remote services [puppet] - 10https://gerrit.wikimedia.org/r/483788 (https://phabricator.wikimedia.org/T210717) (owner: 10Giuseppe Lavagetto) [15:50:10] 10Operations, 10ops-eqiad, 10DBA: db1115 (tendril DB) had OOM for some processes - https://phabricator.wikimedia.org/T196726 (10jcrespo) a:03Cmjohnson Asking @Cmjohnson to move around this memory module- either it got disconnected or broken completely- FYI 128 GB of memory should be detected, but only 96 a... [15:50:31] (03Merged) 10jenkins-bot: Wait for onConfigUpdate initialization in setServers using inlineCallbacks [debs/pybal] - 10https://gerrit.wikimedia.org/r/477793 (owner: 10Mark Bergsma) [15:51:48] (03CR) 10Mark Bergsma: [C: 03+2] Ensure that depool threshold is being honored on new/updated configs [debs/pybal] - 10https://gerrit.wikimedia.org/r/443967 (https://phabricator.wikimedia.org/T184715) (owner: 10Vgutierrez) [15:52:33] (03Merged) 10jenkins-bot: Ensure that depool threshold is being honored on new/updated configs [debs/pybal] - 10https://gerrit.wikimedia.org/r/443967 (https://phabricator.wikimedia.org/T184715) (owner: 10Vgutierrez) [15:52:35] (03CR) 10Gehel: [C: 04-1] "minor comments inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/483628 (https://phabricator.wikimedia.org/T213234) (owner: 10Smalyshev) [15:54:32] 10Operations, 10ops-eqiad, 10DBA: db1115 (tendril DB) had OOM for some processes and some hw (memory) issues - https://phabricator.wikimedia.org/T196726 (10jcrespo) [15:55:44] (03PS11) 10Gehel: Create second Blazegraph instance for categories [puppet] - 10https://gerrit.wikimedia.org/r/483628 (https://phabricator.wikimedia.org/T213234) (owner: 10Smalyshev) [15:55:58] (03CR) 10Gehel: [C: 04-1] Create second Blazegraph instance for categories (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/483628 (https://phabricator.wikimedia.org/T213234) (owner: 10Smalyshev) [15:56:00] (03CR) 10Mark Bergsma: [C: 03+2] Call _updateServerMetrics from _serverInitDone [debs/pybal] - 10https://gerrit.wikimedia.org/r/477794 (owner: 10Mark Bergsma) [15:56:08] (03CR) 10jerkins-bot: [V: 04-1] Create second Blazegraph instance for categories [puppet] - 10https://gerrit.wikimedia.org/r/483628 (https://phabricator.wikimedia.org/T213234) (owner: 10Smalyshev) [15:56:45] (03Merged) 10jenkins-bot: Call _updateServerMetrics from _serverInitDone [debs/pybal] - 10https://gerrit.wikimedia.org/r/477794 (owner: 10Mark Bergsma) [15:56:47] (03CR) 10Mark Bergsma: "> Patch Set 1:" (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/478203 (owner: 10Mark Bergsma) [15:57:08] 10Operations, 10ops-eqiad, 10DBA: db1115 (tendril DB) had OOM for some processes and some hw (memory) issues - https://phabricator.wikimedia.org/T196726 (10jcrespo) More logs, confirming the module is probably dead: `lines=10 2018-12-16T07:40:35-0600 PR8 **Device not detected: DDR4 DIMM(Socket B2)**... [16:08:37] (03PS1) 10Elukey: [TEST] Remove user elukey [puppet] - 10https://gerrit.wikimedia.org/r/483791 [16:13:00] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 632.03 seconds [16:17:43] (03PS1) 10Andrew Bogott: Horizon: disable 'bastion' and 'redirects' in eqiad region [puppet] - 10https://gerrit.wikimedia.org/r/483792 (https://phabricator.wikimedia.org/T204745) [16:17:45] (03PS1) 10Andrew Bogott: dynamicproxy api: include sqlite3 package [puppet] - 10https://gerrit.wikimedia.org/r/483793 (https://phabricator.wikimedia.org/T213540) [16:17:49] (03PS1) 10Andrew Bogott: Add new nova proxy IPs to some firewall defs [puppet] - 10https://gerrit.wikimedia.org/r/483794 (https://phabricator.wikimedia.org/T213540) [16:18:25] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: disable 'bastion' and 'redirects' in eqiad region [puppet] - 10https://gerrit.wikimedia.org/r/483792 (https://phabricator.wikimedia.org/T204745) (owner: 10Andrew Bogott) [16:19:14] (03CR) 10Andrew Bogott: [C: 03+2] dynamicproxy api: include sqlite3 package [puppet] - 10https://gerrit.wikimedia.org/r/483793 (https://phabricator.wikimedia.org/T213540) (owner: 10Andrew Bogott) [16:20:18] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 270.35 seconds [16:20:27] (03CR) 10Andrew Bogott: [C: 03+2] Add new nova proxy IPs to some firewall defs [puppet] - 10https://gerrit.wikimedia.org/r/483794 (https://phabricator.wikimedia.org/T213540) (owner: 10Andrew Bogott) [16:24:40] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: Traceback (most recent call last): https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [16:24:56] (03PS1) 10Mathew.onipe: maps: migrate maps1003 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/483798 (https://phabricator.wikimedia.org/T198622) [16:25:24] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: Traceback (most recent call last): https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [16:25:38] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: Traceback (most recent call last): https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [16:26:22] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: Traceback (most recent call last): https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [16:26:35] 10Operations, 10ops-eqiad, 10DBA: db1115 (tendril DB) had OOM for some processes and some hw (memory) issues - https://phabricator.wikimedia.org/T196726 (10Marostegui) This host should also be under warranty I guess? [16:27:09] are these ripe-atlas alerts actionable? [16:27:58] o/ [16:28:58] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 1 probes of 407 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [16:29:34] We're writing the onboarding material for the Research team and I have a question about the instructions in https://wikitech.wikimedia.org/wiki/Production_shell_access#Requesting_access . In our case, usually the researchers need access to two groups: "researchers" and "analytics-privatedata-users" . are these the same as resources? [16:29:42] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 1 probes of 405 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [16:29:54] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 1 probes of 407 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [16:30:22] 10Operations, 10Analytics, 10EventBus, 10Services (watching): Discovery for Kafka cluster brokers - https://phabricator.wikimedia.org/T213561 (10Ottomata) p:05Triage→03Normal [16:30:40] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 1 probes of 408 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [16:33:52] leila: yup! replacing “RESOURCE” in the task template with the group names will do the trick [16:34:21] herron: perfect. thanks! [16:36:34] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [16:36:54] (03CR) 10MarcoAurelio: "According to https://puppet-compiler.wmflabs.org/compiler1002/94/phab1001.eqiad.wmnet/ the patch compiles okay. It looks like more than 'p" [puppet] - 10https://gerrit.wikimedia.org/r/483623 (owner: 10Dzahn) [16:36:56] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [16:37:22] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review: Fix prometheus elasticsearch exporter to show all the metrics - https://phabricator.wikimedia.org/T210592 (10Gehel) New .deb is available on https://people.wikimedia.org/~gehel/prometheus-elasticsearch-exporter/ @Mathew.... [16:38:14] 10Operations, 10ops-eqiad, 10DBA: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Marostegui) @RobH I can do the OS installation once you give me the green light for it. [16:40:14] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is OK: OK - nfs-exportd is active [16:40:34] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active [16:40:44] (03PS2) 10Cwhite: hiera: add cluster definition to snapshot servers [puppet] - 10https://gerrit.wikimedia.org/r/483602 (https://phabricator.wikimedia.org/T210486) [16:40:46] (03CR) 10jerkins-bot: [V: 04-1] hiera: add cluster definition to snapshot servers [puppet] - 10https://gerrit.wikimedia.org/r/483602 (https://phabricator.wikimedia.org/T210486) (owner: 10Cwhite) [16:41:34] (03PS2) 10Ppchelko: Fix grafana alert check to accomodate new grafana version [puppet] - 10https://gerrit.wikimedia.org/r/483653 (https://phabricator.wikimedia.org/T213506) [16:43:45] (03CR) 10Ppchelko: "This patch fixes the immediate problem of alerts not working." [puppet] - 10https://gerrit.wikimedia.org/r/483653 (https://phabricator.wikimedia.org/T213506) (owner: 10Ppchelko) [16:46:19] (03CR) 10CDanis: [C: 03+1] "I was going to say something about using dashboard IDs instead of names, but then I saw that you're already planning on that :)" [puppet] - 10https://gerrit.wikimedia.org/r/483653 (https://phabricator.wikimedia.org/T213506) (owner: 10Ppchelko) [16:46:46] (03PS1) 10Thcipriani: Add Blubber directory to releases server [puppet] - 10https://gerrit.wikimedia.org/r/483800 (https://phabricator.wikimedia.org/T213563) [16:48:44] (03CR) 10Filippo Giunchedi: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/483653 (https://phabricator.wikimedia.org/T213506) (owner: 10Ppchelko) [16:48:52] (03PS3) 10Filippo Giunchedi: Fix grafana alert check to accomodate new grafana version [puppet] - 10https://gerrit.wikimedia.org/r/483653 (https://phabricator.wikimedia.org/T213506) (owner: 10Ppchelko) [16:48:58] (03PS3) 10Cwhite: hiera: add cluster definition to dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/483602 (https://phabricator.wikimedia.org/T210486) [16:49:59] (03PS4) 10Cwhite: hiera: add cluster definition to dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/483602 (https://phabricator.wikimedia.org/T210486) [16:50:07] (03PS1) 10Marostegui: db-codfw.php: Depool db2060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483801 (https://phabricator.wikimedia.org/T210713) [16:51:30] (03CR) 10Hoo man: [C: 03+2] "Manually tested on sn1008" [dumps/dcat] - 10https://gerrit.wikimedia.org/r/425993 (https://phabricator.wikimedia.org/T154914) (owner: 10Lokal Profil) [16:52:10] (03Merged) 10jenkins-bot: Allow format to be overridden in mediatype object [dumps/dcat] - 10https://gerrit.wikimedia.org/r/425993 (https://phabricator.wikimedia.org/T154914) (owner: 10Lokal Profil) [16:52:41] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Depool db2060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483801 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [16:53:11] !log Defragment change_tag table on db2060 - T210713 [16:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:14] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [16:53:51] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483801 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [16:55:02] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2060 T210713 (duration: 00m 46s) [16:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:58] (03CR) 10Filippo Giunchedi: [C: 03+1] hiera: add cluster definition to dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/483602 (https://phabricator.wikimedia.org/T210486) (owner: 10Cwhite) [17:00:33] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [17:00:43] (03CR) 10jenkins-bot: db-codfw.php: Depool db2060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483801 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [17:03:39] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [17:09:41] (03PS2) 10Giuseppe Lavagetto: mediawiki::common: add proxy for services [puppet] - 10https://gerrit.wikimedia.org/r/483789 (https://phabricator.wikimedia.org/T210717) [17:10:40] !log Deploy schema change on db2060 - T210713 [17:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:43] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [17:17:16] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T213397 (10herron) Thanks for the update @noarave Standing by for confirmation from @RStallman-legalteam [17:17:21] z/win 10 [17:17:23] ufff [17:23:30] (03CR) 10Gehel: [C: 04-1] Create second Blazegraph instance for categories (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/483628 (https://phabricator.wikimedia.org/T213234) (owner: 10Smalyshev) [17:27:26] (03PS33) 10Elukey: admin: allow users to be deployed without ssh keys configured [puppet] - 10https://gerrit.wikimedia.org/r/482275 (https://phabricator.wikimedia.org/T212949) [17:31:09] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [17:32:13] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [17:32:21] (03CR) 10Dzahn: "protocol inside ServerName is invalid syntax though.." [puppet] - 10https://gerrit.wikimedia.org/r/483775 (https://phabricator.wikimedia.org/T95164) (owner: 10Hashar) [17:32:50] (03CR) 10Elukey: ">" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/482275 (https://phabricator.wikimedia.org/T212949) (owner: 10Elukey) [17:41:36] (03CR) 10Dzahn: doc: fix Apache redirects to use https (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/483775 (https://phabricator.wikimedia.org/T95164) (owner: 10Hashar) [17:43:32] 10Operations, 10Recommendation-API, 10Research, 10Patch-For-Review, 10Services (watching): Recommendation API improvements - https://phabricator.wikimedia.org/T213222 (10bmansurov) @hashar can you pelase review https://gerrit.wikimedia.org/r/c/integration/config/+/483225 [17:45:08] 10Operations, 10SRE-Access-Requests: add Greg Grossmeier to Phabricator admins group - https://phabricator.wikimedia.org/T213569 (10Dzahn) [17:45:53] 10Operations, 10SRE-Access-Requests: add Greg Grossmeier to Phabricator admins group - https://phabricator.wikimedia.org/T213569 (10Dzahn) p:05Triage→03Normal [17:46:01] 10Operations, 10monitoring, 10Patch-For-Review, 10Performance-Team (Radar): Provision >= 50% of statsd/Graphite-only metrics in Prometheus - https://phabricator.wikimedia.org/T205870 (10colewhite) [17:46:20] (03PS3) 10Dzahn: admins: add Greg to phabricator-admins [puppet] - 10https://gerrit.wikimedia.org/r/483623 (https://phabricator.wikimedia.org/T213569) [17:54:19] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [17:55:25] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [18:00:21] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483812 [18:05:46] (03CR) 10Marostegui: [C: 03+2] Revert "db-codfw.php: Depool db2060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483812 (owner: 10Marostegui) [18:06:50] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483812 (owner: 10Marostegui) [18:07:51] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2060 T210713 (duration: 00m 46s) [18:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:55] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [18:10:47] (03CR) 10Smalyshev: Prepare for multi-instance Blazegraph (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/483529 (https://phabricator.wikimedia.org/T213234) (owner: 10Smalyshev) [18:12:37] (03CR) 10Smalyshev: Create second Blazegraph instance for categories (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/483628 (https://phabricator.wikimedia.org/T213234) (owner: 10Smalyshev) [18:15:09] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [18:16:13] hmm maybe zotero is acting up again [18:17:31] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [18:18:34] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483812 (owner: 10Marostegui) [18:21:05] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [18:22:11] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [18:24:14] ^ if citoid keeps recovering, could be that someone is adding weird/long/whatever urls [18:28:32] 10Operations, 10Parsoid, 10Patch-For-Review: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10Legoktm) [18:28:35] 10Operations, 10Packaging, 10uprightdiff, 10Parsoid-Tests: stretch version of uprightdiff package - https://phabricator.wikimedia.org/T212987 (10Legoktm) 05Open→03Resolved a:03Legoktm uprightdiff is now in stretch-backports: https://tracker.debian.org/news/1019743/accepted-uprightdiff-130-1bpo91-sour... [18:28:47] (03CR) 10Ottomata: [C: 03+1] mathoid: Move config.yaml into a template (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/483184 (owner: 10Alexandros Kosiaris) [18:34:15] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: add Greg Grossmeier to Phabricator admins group - https://phabricator.wikimedia.org/T213569 (10herron) [18:35:43] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: add Greg Grossmeier to Phabricator admins group - https://phabricator.wikimedia.org/T213569 (10herron) Adding the usual checklist, even though it's nearly all done. Since this involves sudo privs it's been flagged for review/approval during the next S... [18:37:07] jouncebot: now [18:37:07] No deployments scheduled for the next 63 hour(s) and 52 minute(s) [18:37:40] greg-g: I'm going to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/483735 to update the ExtensionDistributor configuration for the new MW release [18:38:18] legoktm: ack, thanks [18:38:35] (03CR) 10Legoktm: [C: 03+2] "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483735 (owner: 10MacFan4000) [18:40:38] (03Merged) 10jenkins-bot: Update ext dist settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483735 (owner: 10MacFan4000) [18:43:02] !log legoktm@deploy1001 Synchronized wmf-config/CommonSettings.php: Update ExtensionDistributor for 1.32 release - https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/483735 (duration: 00m 46s) [18:43:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:30] (03CR) 10jenkins-bot: Update ext dist settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483735 (owner: 10MacFan4000) [18:43:37] woot [18:45:20] (03PS1) 10Ppchelko: Reference grafana dashboards by UID for alerting. [puppet] - 10https://gerrit.wikimedia.org/r/483820 [18:45:55] (03CR) 10jerkins-bot: [V: 04-1] Reference grafana dashboards by UID for alerting. [puppet] - 10https://gerrit.wikimedia.org/r/483820 (owner: 10Ppchelko) [18:47:00] (03CR) 10Hashar: [V: 03+1] doc: fix Apache redirects to use https (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/483775 (https://phabricator.wikimedia.org/T95164) (owner: 10Hashar) [18:48:37] (03PS12) 10Smalyshev: Create second Blazegraph instance for categories [puppet] - 10https://gerrit.wikimedia.org/r/483628 (https://phabricator.wikimedia.org/T213234) [18:49:56] (03CR) 10Smalyshev: Prepare for multi-instance Blazegraph (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/483529 (https://phabricator.wikimedia.org/T213234) (owner: 10Smalyshev) [18:50:23] (03PS2) 10Ppchelko: Reference grafana dashboards by UID for alerting. [puppet] - 10https://gerrit.wikimedia.org/r/483820 [18:50:44] I noticed that changeprop requests to ORES dropped dramatically at 18:36, which doesn't coincide with any deployments. Any clues what might have happened? [18:51:13] (03CR) 10jerkins-bot: [V: 04-1] Reference grafana dashboards by UID for alerting. [puppet] - 10https://gerrit.wikimedia.org/r/483820 (owner: 10Ppchelko) [18:54:01] (03PS3) 10Ppchelko: Reference grafana dashboards by UID for alerting. [puppet] - 10https://gerrit.wikimedia.org/r/483820 [18:58:36] Looks like changeprop just recovered. [19:18:16] (03CR) 10Smalyshev: Create second Blazegraph instance for categories (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/483628 (https://phabricator.wikimedia.org/T213234) (owner: 10Smalyshev) [19:18:22] (03CR) 10Smalyshev: [C: 03+1] Create second Blazegraph instance for categories [puppet] - 10https://gerrit.wikimedia.org/r/483628 (https://phabricator.wikimedia.org/T213234) (owner: 10Smalyshev) [19:33:58] (03PS3) 10Ottomata: [WIP] Helm chart for eventgate-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/483035 (https://phabricator.wikimedia.org/T211247) [19:36:57] 10Operations, 10ORES, 10Scoring-platform-team, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Backlog): The continuous release pipeline should support more than one service per repo - https://phabricator.wikimedia.org/T210267 (10thcipriani) I think there are a couple of problems with the curre... [19:39:05] * Nemo_bis misread evangelic-analytic [19:42:46] (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+1] "I tested this on deployment-prep, LGTM. We will need a condition to use this systemd hardening only in stretch, and use firejail in jessie" [puppet] - 10https://gerrit.wikimedia.org/r/482309 (https://phabricator.wikimedia.org/T212941) (owner: 10Muehlenhoff) [19:46:14] 10Operations, 10Puppet, 10Packaging: Prepare puppet for Debian buster - https://phabricator.wikimedia.org/T213546 (10herron) p:05Triage→03Normal [19:46:22] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Investigate systemd hardening to replace Firejail for Thumbor - https://phabricator.wikimedia.org/T212941 (10jijiki) [19:47:11] 10Operations, 10Patch-For-Review: Prepare our base system layer for Debian buster - https://phabricator.wikimedia.org/T213527 (10herron) p:05Triage→03Normal [19:47:45] 10Operations, 10TCB-Team, 10WMF-JobQueue, 10monitoring, and 3 others: Grafana alerting broken after upgrade to 5.0.0 - https://phabricator.wikimedia.org/T213506 (10herron) p:05Triage→03High [19:48:21] 10Operations, 10Developer-Advocacy, 10Gerrit, 10serviceops: Remove port 29418 from cloning process - https://phabricator.wikimedia.org/T37611 (10herron) p:05Triage→03Normal [19:53:03] 10Operations, 10Packaging, 10uprightdiff, 10Parsoid-Tests: stretch version of uprightdiff package - https://phabricator.wikimedia.org/T212987 (10Dzahn) Thanks @Legoktm :) Confirmed it's already installed by puppet on scandium, i did not have to do anything, nice. from: Jan 11 14:55:53 scandium puppet-age... [19:55:53] 10Operations, 10Parsoid, 10Patch-For-Review: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10Dzahn) Thanks to Legoktm uploading the package in T212987 and puppet, the uprightdiff package has been installed automatically. from: Jan 11 14:55:53 scandiu... [20:36:08] 10Operations, 10ORES, 10Scoring-platform-team, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Backlog): The continuous release pipeline should support more than one service per repo - https://phabricator.wikimedia.org/T210267 (10Ottomata) Seems like it would work, but it doesn't look like this... [20:39:37] 10Operations, 10DBA, 10Performance-Team: Increase parsercache keys TTL from 22 days back to 30 days - https://phabricator.wikimedia.org/T210992 (10Marostegui) So we are close to get the last host ready {T207258}, once that is done I will revert everything back to 30 days. [20:41:57] 10Operations, 10DBA, 10Patch-For-Review, 10User-Banyek: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10Marostegui) We are waiting on Robh to finish the last DCOps steps so we can get these installed and ready by, hopeful... [20:42:40] 10Operations, 10DBA, 10Patch-For-Review: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10Marostegui) 05Stalled→03Open [20:45:16] (03PS1) 10RobH: pc1007 prod dns entries [dns] - 10https://gerrit.wikimedia.org/r/483853 (https://phabricator.wikimedia.org/T208383) [20:46:14] (03CR) 10RobH: [C: 03+2] pc1007 prod dns entries [dns] - 10https://gerrit.wikimedia.org/r/483853 (https://phabricator.wikimedia.org/T208383) (owner: 10RobH) [20:48:05] 10Operations, 10ops-eqiad, 10DBA: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10RobH) production dns updated NIC.Embedded.1-1-1 Ethernet = D0:94:66:75:D1:63 [20:50:23] (03PS1) 10Marostegui: install_server: Install pc1007 [puppet] - 10https://gerrit.wikimedia.org/r/483854 (https://phabricator.wikimedia.org/T207258) [20:55:37] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Marostegui) [20:58:17] I have a datacenter question: A research client is running a job from Germany, against ores.wikimedia.org. That resolves to a esams LB 91.198.174.192, but I don't understand how that request gets forwarded to an ORES service. ORES is only in codfw and eqiad. [21:02:36] akosiaris: ^ if you happen to know [21:05:56] awight: look at the XFF fields or x-real-ip in the header [21:11:03] chasemp: I'm trying to log into hooft.esams.wikimedia.org and get "unknown host" fwiw [21:16:15] awight: ores.wm.o resolves to the nearest DC roughly with geo aware DNS I expect, and goes through varnish and LVS back to eqiad/codfw the same as regular site traffic. I haven't looked but that's the normal gist. Is the question, why are they hitting ores in esams? [21:17:52] chasemp: Okay thanks for the outline. What I'm trying to debug is that a parallel consumer 20-40 threads) is mostly choked up with TIME_WAIT TCP connections. [21:19:25] Most likely it's caused by our client code, unless you think it might be exacerbated by varnish or the LB? [21:22:17] I wouldn't think at that volume esp no [21:29:30] Cool, I'll just debug our script for now. [21:37:02] !log jforrester@deploy1001 Started scap: Full scap sync to update wmf.12 i18n for the weekend Idf2a67860f [21:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:49] (03CR) 10CDanis: Reference grafana dashboards by UID for alerting. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/483820 (owner: 10Ppchelko) [21:39:54] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10RobH) a:05RobH→03Marostegui [21:44:17] (03PS1) 10Bstorm: sonofgridengine: control the global and scheduler grid config in puppet [puppet] - 10https://gerrit.wikimedia.org/r/483864 (https://phabricator.wikimedia.org/T213183) [21:45:07] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: control the global and scheduler grid config in puppet [puppet] - 10https://gerrit.wikimedia.org/r/483864 (https://phabricator.wikimedia.org/T213183) (owner: 10Bstorm) [21:50:45] (03PS2) 10Bstorm: sonofgridengine: control the global and scheduler grid config in puppet [puppet] - 10https://gerrit.wikimedia.org/r/483864 (https://phabricator.wikimedia.org/T213183) [21:56:14] !log jforrester@deploy1001 Finished scap: Full scap sync to update wmf.12 i18n for the weekend Idf2a67860f (duration: 19m 12s) [21:56:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:19] (03CR) 10Bstorm: "So as long as I did the puppet stuff right, this should trivially add puppet control of the global and scheduler config for sonofgridengin" [puppet] - 10https://gerrit.wikimedia.org/r/483864 (https://phabricator.wikimedia.org/T213183) (owner: 10Bstorm) [21:57:24] OK, all done. Let's not have any more deploys. :-) [21:57:54] * Hauskatze serves a cup of Earl Grey tea to James_F [21:58:09] (03CR) 10Bstorm: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/483864 (https://phabricator.wikimedia.org/T213183) (owner: 10Bstorm) [21:58:12] * James_F grins. [21:58:25] maybe another flavour, Sir? [22:01:47] Hauskatze: Lady Grey is rather fine, though my latest favourite is TWG's Silver Moon. [22:03:19] twg = twinnings? [22:03:32] (03Abandoned) 10Bstorm: network: Add the new cloud region to all_networks [puppet] - 10https://gerrit.wikimedia.org/r/481215 (https://phabricator.wikimedia.org/T212327) (owner: 10Bstorm) [22:03:47] I stick to English Breakfast, Green and sometimes Prince of Wales [22:03:55] Hauskatze: Nope. https://twgtea.com/Products/Silver-Moon-Tea-1 [22:04:12] (03PS3) 10Bstorm: sonofgridengine: control the global and scheduler grid config in puppet [puppet] - 10https://gerrit.wikimedia.org/r/483864 (https://phabricator.wikimedia.org/T213183) [22:05:33] (03CR) 10Bstorm: [C: 03+2] sonofgridengine: control the global and scheduler grid config in puppet [puppet] - 10https://gerrit.wikimedia.org/r/483864 (https://phabricator.wikimedia.org/T213183) (owner: 10Bstorm) [22:05:37] James_F: 40 bucks for 100 grams of Tea O_o?! [22:06:01] Hauskatze: Singapore prices. But yeah, they're not cheap. [22:07:23] (03PS4) 10Ppchelko: Reference grafana dashboards by UID for alerting. [puppet] - 10https://gerrit.wikimedia.org/r/483820 [22:07:41] 23,01 Pounds Sterling [22:14:56] Lapsang souchong or Rooibos, please. ;) [22:19:29] are we talking about tea? [22:19:34] i have something to share then [22:20:40] so.. Hibiscus is where it's at. like specifically "Superflower Tea" https://www.republicoftea.com/natural-hibiscus-tea/p/v00684/ [22:21:06] but notice how that costs over 10 bucks for some bags with few grams of actual tea [22:22:12] so i looked at the ingredients list and it's only 3. Nigerian Hibiscus (flower), Sweet Blackberry (leaf), and Stevia (leaf). So i found all of these in bulk for a fraction of the price and copied the recipe and it tastes identical [22:22:39] you can get that by the pound and use a metal tea egg [22:23:03] btw, the sweet blackberry and stevia make it sweet without using any sugar [22:24:53] also the Nigerian Hibiscus flower is what you get when you order the red "herbal" tea at Starbucks/Pete's [22:25:30] local name "zobo" [22:26:47] https://www.drugs.com/npc/hibiscus.html | https://species.wikimedia.org/wiki/Hibiscus [22:28:32] hah [22:29:30] mutante the alchemist :D [22:30:00] * Hauskatze is testing with logrotate [22:30:16] actually I have to create the logrotate file and I'm reading some docs [22:32:47] i'm trying to get npm and nodejs installed on stretch [22:33:02] to replace the parsoid test box on jessie [22:33:35] also yea, i was talking about "infusions" instead of tea [22:35:43] 10Puppet, 10ORES, 10Scoring-platform-team (Current): orespoolcounter1002.eqiad.wmnet reporting compile errors - https://phabricator.wikimedia.org/T213586 (10Halfak) [22:36:35] 10Puppet, 10ORES, 10Scoring-platform-team (Current): orespoolcounter1002.eqiad.wmnet reporting compile errors - https://phabricator.wikimedia.org/T213586 (10Halfak) Thanks to @MarcoAurelio for reporting. I don't have enough puppet expertise to know what to do with this. But I figure that @akosiaris might k... [23:07:14] (03PS1) 10MarcoAurelio: [WIP] mediawiki: stop logging each run of purge_abusefilter.pp [puppet] - 10https://gerrit.wikimedia.org/r/483876 [23:08:50] (03CR) 10CDanis: Reference grafana dashboards by UID for alerting. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/483820 (owner: 10Ppchelko) [23:12:06] (03PS2) 10MarcoAurelio: [WIP] mediawiki: stop logging each run of purge_abusefilter.pp [puppet] - 10https://gerrit.wikimedia.org/r/483876 [23:12:25] (03CR) 10Ppchelko: Reference grafana dashboards by UID for alerting. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/483820 (owner: 10Ppchelko) [23:12:44] (03PS5) 10Ppchelko: Reference grafana dashboards by UID for alerting. [puppet] - 10https://gerrit.wikimedia.org/r/483820 [23:12:57] (03CR) 10jerkins-bot: [V: 04-1] [WIP] mediawiki: stop logging each run of purge_abusefilter.pp [puppet] - 10https://gerrit.wikimedia.org/r/483876 (owner: 10MarcoAurelio) [23:13:14] damn [23:14:44] (03PS3) 10MarcoAurelio: [WIP] mediawiki: stop logging each run of purge_abusefilter.pp [puppet] - 10https://gerrit.wikimedia.org/r/483876 [23:15:16] Hauskatze: you already know why? it's in the commmit message, not the code [23:15:25] yep [23:15:28] 'k [23:15:33] commit message validator [23:16:14] (03CR) 10MarcoAurelio: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/483876 (owner: 10MarcoAurelio) [23:18:10] (03CR) 10MarcoAurelio: "https://puppet-compiler.wmflabs.org/compiler1002/96/" [puppet] - 10https://gerrit.wikimedia.org/r/483876 (owner: 10MarcoAurelio) [23:19:19] (03CR) 10CDanis: [C: 03+1] "This looks good to me, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/483820 (owner: 10Ppchelko) [23:20:13] (03CR) 10Ppchelko: "That was exactly my idea as well :)" [puppet] - 10https://gerrit.wikimedia.org/r/483820 (owner: 10Ppchelko) [23:22:45] 10Puppet, 10AbuseFilter: Check if it is safe to disable logging for purge_abusefilter.pp cron job - https://phabricator.wikimedia.org/T213591 (10MarcoAurelio) [23:23:48] (03PS4) 10MarcoAurelio: [WIP] mediawiki: Stop logging each run of purge_abusefilter.pp [puppet] - 10https://gerrit.wikimedia.org/r/483876 (https://phabricator.wikimedia.org/T213591) [23:32:14] [ 2019-01-11T23:17:17 ] ERROR: Unable to find facts for host , skipping [23:32:40] so mwmaint1002.eqiad.wmnet,mwmaint2001.codfw.wmnet then [23:32:46] no space btw comma [23:47:23] Hauskatze: btw... check out this https://phabricator.wikimedia.org/T211250 [23:47:36] https://phabricator.wikimedia.org/T211250#4864269 [23:47:54] that is to unify all the maintenance crons [23:48:00] and happening currently [23:48:14] and it's also replacing cron with systemd timers [23:48:56] 10Puppet, 10AbuseFilter, 10Patch-For-Review: Check if it is safe to disable logging for purge_abusefilter.pp cron job - https://phabricator.wikimedia.org/T213591 (10MarcoAurelio) On `deployment-mwmaint01` I've checked `/var/log/mediawiki/puge_abusefilter.log` (last 500 lines) and I didn't see anything wrong.... [23:49:04] mutante: checking [23:50:03] mutante: oh, that looks promising. Certainly logrotate would make less annoying to keep logs [23:50:09] which can help identify issues [23:50:44] fwiw mutante when I log-in to mwmaint it says: [23:50:46] deployment-mwmaint01 is a Mediawiki Maintenance Server: pagetriage extension (me diawiki::maintenance::pagetriage) [23:50:46] deployment-mwmaint01 is a Mediawiki Maintenance Server: parser cache purging (me diawiki::maintenance::parsercachepurging) [23:50:46] deployment-mwmaint01 is a noc.wikimedia.org (noc::site) [23:51:06] but it doesn't list all mediawiki::maintenance:: jobs [23:51:50] I guess we need system::role { 'mediawiki::maintenance::pagetriage': description => 'Mediawiki Maintenance Server: pagetriage extension' } [23:53:04] Hauskatze: yes, system::role is what should generate the motd snippets [23:53:32] and require ::mediawiki::users is also needed? [23:53:33] if we actually want to list all of them.. i dont know [23:53:50] yes, so that they can run as the mediawiki user [23:53:50] probably no [23:54:12] purge_abusefilter.pp doesn't have it and runs [23:54:41] it has user => $::mediawiki::users::web, though [23:56:01] class mediawiki::users( [23:56:01] $web = 'www-data', [23:56:21] that is mediawiki::users and parameter $web [23:56:40] wo runs as www-data unless you provide another one [23:56:44] so [23:56:46] so, apparently Venezuela has either DNS blocked es.wikipedia right now, or there's a flood of traffic that's resetting connections [23:57:00] (from wikipedia-es) [23:58:32] I don't think we want to list all mediawiki::maintenance jobs: https://gerrit.wikimedia.org/g/operations/puppet/+/production/modules/profile/manifests/mediawiki/maintenance.pp