[00:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for Evening SWAT (Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180223T0000). [00:00:05] No GERRIT patches in the queue for this window AFAICS. [00:00:49] Actually, I have some stuff for swat that I promised someone [00:00:57] Krinkle: You go first, I need a few minutes afk first [00:01:06] k [00:01:22] (03CR) 10Krinkle: [C: 032] multiversion: Remove support for MW_LANG env override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410110 (owner: 10Krinkle) [00:01:34] Using mwdebug1002 now [00:02:15] Ha, ack found a match in docroot/m.wikipedia.org/w/mobilelanding.php [00:02:51] ... which isn't in git? [00:02:53] (03Merged) 10jenkins-bot: multiversion: Remove support for MW_LANG env override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410110 (owner: 10Krinkle) [00:03:05] Hm.. it is [00:03:15] https://codesearch.wmflabs.org/search/?q=MW_LANG [00:03:24] GitHub search isn't finding it [00:03:26] Yay [00:03:29] Thanks legoktm [00:03:45] (03CR) 10jenkins-bot: multiversion: Remove support for MW_LANG env override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410110 (owner: 10Krinkle) [00:03:49] wooo! [00:04:22] (03PS1) 10Krinkle: Revert "multiversion: Remove support for MW_LANG env override" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413641 [00:04:24] (03CR) 10Krinkle: [C: 032] Revert "multiversion: Remove support for MW_LANG env override" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413641 (owner: 10Krinkle) [00:05:11] Oh crap, the use in mobilelanding.php is horrible. [00:05:37] (03Merged) 10jenkins-bot: Revert "multiversion: Remove support for MW_LANG env override" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413641 (owner: 10Krinkle) [00:05:41] I'm glad that's only exposed on *.wikipedia.org [00:05:52] otherwise it would've been a real mess to clean that up [00:06:55] (03CR) 10jenkins-bot: Revert "multiversion: Remove support for MW_LANG env override" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413641 (owner: 10Krinkle) [00:07:40] (03PS1) 10Krinkle: mobilelanding.php: Set wiki context directly instead of MW_LANG indirection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413644 [00:07:53] I'll roll this out, but I'll leave the multiversion change for another day. [00:08:37] legoktm: Hm.. does codesearch index puppet? [00:08:42] It seems it didn't find https://github.com/wikimedia/puppet/blob/6442ba1b64cd669997950c104e61f153a3d5fcfa/modules/mediawiki/templates/apache/sites/wwwportals.conf.erb#L95 [00:08:46] for 'mobilelanding' [00:08:47] no, not yet [00:09:02] there's an issue because the main branch is "production" and it currently hardcodes "master" [00:09:03] I knew it didn't do ops at first, but I saw it indexing mw-config [00:09:06] so I kind of assumed.. [00:09:27] there's an upstream PR that's pending last I checked [00:09:30] (03CR) 10Krinkle: [C: 032] mobilelanding.php: Set wiki context directly instead of MW_LANG indirection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413644 (owner: 10Krinkle) [00:09:49] legoktm: how far upstream? [00:10:16] https://github.com/etsy/hound/pull/275 [00:10:57] (03Merged) 10jenkins-bot: mobilelanding.php: Set wiki context directly instead of MW_LANG indirection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413644 (owner: 10Krinkle) [00:11:10] (03CR) 10jenkins-bot: mobilelanding.php: Set wiki context directly instead of MW_LANG indirection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413644 (owner: 10Krinkle) [00:15:55] Krinkle: not that I know of, re: renderer decom [00:26:47] Krinkle: Quote in github btw. [00:27:10] Er, not even then for your instances [00:27:19] s/s$// [00:31:51] testing the patchnow [00:34:59] https://phabricator.wikimedia.org/T69015#3761037 [00:35:01] Useless [00:35:05] Untestable/unreachable [00:37:26] !log krinkle@tin Synchronized docroot/m.wikipedia.org/w/mobilelanding.php: Ia54cd736f30808 - rm use of MW_LANG (duration: 01m 13s) [00:37:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:28] legoktm: Interesting, I'd assume the default to be HEAD, which git supports on remotes as well, as a way of communicating/avoiding the need to know the default branch [00:40:43] bblack: Should I create one? [00:40:44] (03CR) 10Chad: [C: 032] Update logos for the Urdu Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410201 (https://phabricator.wikimedia.org/T187182) (owner: 10Odder) [00:40:47] (after trying again to find onw) [00:40:50] no_justification: done btw :) [00:40:55] * Krinkle releases virtual lock [00:41:15] * no_justification was gonna go anyway :P [00:41:18] yolo [00:42:33] Krinkle: I feel like every time we come across T69015 or something related we all recoil in horror [00:42:34] T69015: m.wikipedia.org and zero.wikipedia.org should redirect how/where - https://phabricator.wikimedia.org/T69015 [00:42:41] And/or surprised? [00:42:47] It's only been like that for years [00:44:57] (03CR) 10Chad: [C: 032] Add favicon for right-to-left Wikibooks projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406624 (https://phabricator.wikimedia.org/T185919) (owner: 10Odder) [00:48:58] (03PS1) 10Krinkle: multiversion: Remove support for MW_LANG env override (2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413646 [00:52:49] !log demon@tin Synchronized static/images/project-logos/: new project logos for urdu wikt (duration: 01m 13s) [00:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:55:07] (03PS5) 10Chad: Update logos for the Urdu Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410201 (https://phabricator.wikimedia.org/T187182) (owner: 10Odder) [00:58:39] (03PS3) 10Chad: Add favicon for right-to-left Wikibooks projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406624 (https://phabricator.wikimedia.org/T185919) (owner: 10Odder) [01:00:24] (03CR) 10jenkins-bot: Update logos for the Urdu Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410201 (https://phabricator.wikimedia.org/T187182) (owner: 10Odder) [01:03:40] (03PS4) 10Chad: Add favicon for right-to-left Wikibooks projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406624 (https://phabricator.wikimedia.org/T185919) (owner: 10Odder) [01:06:55] !log demon@tin Synchronized static/favicon/wikibooks-rtl.ico: rtl wikibooks logo (duration: 01m 12s) [01:07:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:08:41] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: rtl wikibooks logo (duration: 01m 13s) [01:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:09:16] (03CR) 10jenkins-bot: Add favicon for right-to-left Wikibooks projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406624 (https://phabricator.wikimedia.org/T185919) (owner: 10Odder) [01:09:25] (03PS4) 10Chad: Shrink favicon file sizes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402618 (https://phabricator.wikimedia.org/T177726) (owner: 10Odder) [01:09:58] (03CR) 10Chad: [C: 032] Shrink favicon file sizes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402618 (https://phabricator.wikimedia.org/T177726) (owner: 10Odder) [01:10:53] (03PS1) 10Jforrester: 2017 wikitext editor: Simplify config part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413651 [01:10:55] (03PS1) 10Jforrester: 2017 wikitext editor: Simplify config part 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413652 [01:10:57] (03PS1) 10Jforrester: 2017 wikitext editor: Enable by default on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413653 (https://phabricator.wikimedia.org/T188028) [01:11:33] (03Merged) 10jenkins-bot: Shrink favicon file sizes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402618 (https://phabricator.wikimedia.org/T177726) (owner: 10Odder) [01:11:59] (03CR) 10jenkins-bot: Shrink favicon file sizes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402618 (https://phabricator.wikimedia.org/T177726) (owner: 10Odder) [01:13:47] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: point mkwikt favicon to en version, dupe (duration: 01m 15s) [01:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:14:57] 10Operations, 10Commons, 10Thumbor, 10Traffic, 10media-storage: Android unable to render file from upload.wikimedia.org "Error 349 ERR_RESPONSE_HEADERS_MULTIPLE_CONTENT_DISPOSITION" - https://phabricator.wikimedia.org/T170605#3994992 (10Krinkle) [01:15:36] 10Operations, 10Commons, 10Thumbor, 10Traffic, 10media-storage: Unable to render file from upload.wikimedia.org "Error 349 ERR_RESPONSE_HEADERS_MULTIPLE_CONTENT_DISPOSITION" - https://phabricator.wikimedia.org/T170605#3436479 (10Krinkle) [01:19:00] !log demon@tin Synchronized static/favicon/: smaller favicons (duration: 01m 12s) [01:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:24:17] (03CR) 10Jforrester: "Depends on I1cb47a044c9 being everywhere before being deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413652 (owner: 10Jforrester) [01:31:15] 10Operations, 10UniversalLanguageSelector, 10I18n: MB Lateefi Fonts for Sindhi Wikipedia. - https://phabricator.wikimedia.org/T138136#3995048 (10mehtab.ahmed) Font issue has got resolved. [01:36:29] 10Operations: Remove imagescaler cluster (aka 'rendering') - https://phabricator.wikimedia.org/T188062#3995055 (10Krinkle) [01:36:39] 10Operations: Remove imagescaler cluster (aka 'rendering') - https://phabricator.wikimedia.org/T188062#3995065 (10Krinkle) [01:43:27] 10Operations: Remove imagescaler cluster (aka 'rendering') - https://phabricator.wikimedia.org/T188062#3995070 (10Krinkle) [01:43:47] 10Operations: Remove imagescaler cluster (aka 'rendering') - https://phabricator.wikimedia.org/T188062#3995055 (10Krinkle) [01:56:15] (03PS2) 10Chad: Move mediawiki.org docroot "from mediawiki" to "mediawiki.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411368 [01:58:15] (03CR) 10Chad: [C: 032] Move mediawiki.org docroot "from mediawiki" to "mediawiki.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411368 (owner: 10Chad) [01:59:46] (03Merged) 10jenkins-bot: Move mediawiki.org docroot "from mediawiki" to "mediawiki.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411368 (owner: 10Chad) [01:59:56] (03CR) 10jenkins-bot: Move mediawiki.org docroot "from mediawiki" to "mediawiki.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411368 (owner: 10Chad) [02:00:44] (03Abandoned) 10Chad: Move wiktionary and foundationwiki docroots to standard docroot [puppet] - 10https://gerrit.wikimedia.org/r/402090 (https://phabricator.wikimedia.org/T126306) (owner: 10Chad) [02:01:14] (03Abandoned) 10Chad: Drop unused docroots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402091 (https://phabricator.wikimedia.org/T126306) (owner: 10Chad) [02:05:23] (03PS2) 10Chad: Gerrit: Allow enabling of tls encryption for SMTP [puppet] - 10https://gerrit.wikimedia.org/r/406145 [02:10:51] !log demon@tin Synchronized docroot/: mw.org docroot moving (duration: 01m 13s) [02:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:41:11] (03PS1) 10Chad: group1 to wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413660 [02:42:25] (03Restored) 10Chad: Move wiktionary and foundationwiki docroots to standard docroot [puppet] - 10https://gerrit.wikimedia.org/r/402090 (https://phabricator.wikimedia.org/T126306) (owner: 10Chad) [02:43:02] (03CR) 10Chad: [V: 032 C: 032] Adding webhooks plugin [software/gerrit] - 10https://gerrit.wikimedia.org/r/409364 (owner: 10Chad) [02:46:32] !log demon@tin Started deploy [gerrit/gerrit@23ebf75]: deploying webhooks plugin [02:46:42] !log demon@tin Finished deploy [gerrit/gerrit@23ebf75]: deploying webhooks plugin (duration: 00m 10s) [02:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:58:51] (03PS1) 10Chad: Gerrit: Set plugin.webhooks.sslVerify = true [puppet] - 10https://gerrit.wikimedia.org/r/413661 [03:23:59] 10Operations, 10Commons, 10Thumbor, 10Traffic, 10media-storage: Unable to render file from upload.wikimedia.org "Error 349 ERR_RESPONSE_HEADERS_MULTIPLE_CONTENT_DISPOSITION" - https://phabricator.wikimedia.org/T170605#3436479 (10BBlack) https://stackoverflow.com/questions/13578428/duplicate-headers-recei... [03:25:47] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 771.04 seconds [03:51:27] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1951 bytes in 0.112 second response time [04:04:57] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 56.79 seconds [04:07:58] PROBLEM - High CPU load on API appserver on mw1230 is CRITICAL: CRITICAL - load average: 34.55, 32.67, 32.07 [04:11:27] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1955 bytes in 0.098 second response time [04:17:55] 10Operations, 10Continuous-Integration-Infrastructure, 10MediaWiki-Core-Tests, 10HHVM: HHVM 3.18.5+dfsg-1+wmf3 changes parse_url causing unit tests to fail - https://phabricator.wikimedia.org/T185024#3903647 (10greg) >>! In T185024#3931478, @MoritzMuehlenhoff wrote: > A revised fix has been released (along... [04:18:27] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1952 bytes in 0.092 second response time [04:28:28] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1925 bytes in 0.101 second response time [04:46:27] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T187442#3995257 (10Dzahn) [04:50:28] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1954 bytes in 0.106 second response time [04:53:35] !log ganeti: creating new VM kafkamon1001 - vcpus=2,memory=8g,disk=60G, row_A eqiad (T187901) [04:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:53:51] T187901: Site: eqiad|codfw VM request for Kafka Burrow Lag monitoring - https://phabricator.wikimedia.org/T187901 [04:56:27] !log ganeti: ganeti2004 - creating new VM kafkamon2001 - vcpus=2,memory=8g,disk=60G, row_A codfw (T187901) [04:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:53] (03PS1) 10Dzahn: admins: remove duplicate outdated entry for chrisneuroth [puppet] - 10https://gerrit.wikimedia.org/r/413667 [05:15:27] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1949 bytes in 0.088 second response time [05:16:56] (03PS1) 10Dzahn: admins: add raz-shuty to ldap_only admins [puppet] - 10https://gerrit.wikimedia.org/r/413668 (https://phabricator.wikimedia.org/T187442) [05:19:00] (03CR) 10Dzahn: [C: 032] admins: add raz-shuty to ldap_only admins [puppet] - 10https://gerrit.wikimedia.org/r/413668 (https://phabricator.wikimedia.org/T187442) (owner: 10Dzahn) [05:19:31] (03CR) 10Dzahn: [C: 032] "ldap group: wmde (for gerrit things)" [puppet] - 10https://gerrit.wikimedia.org/r/413668 (https://phabricator.wikimedia.org/T187442) (owner: 10Dzahn) [05:24:07] (03PS2) 10Dzahn: admins: remove duplicate outdated entry for chrisneuroth [puppet] - 10https://gerrit.wikimedia.org/r/413667 [05:24:47] PROBLEM - puppet last run on cp1052 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:25:55] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Patch-For-Review: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T187442#3995280 (10Dzahn) >>! In T187442#3990820, @Dzahn wrote: > I'll also start adding the "wmde" users to our "ldap_only" admins group then to avoi... [05:26:20] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Patch-For-Review: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T187442#3995281 (10Dzahn) [05:26:34] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Patch-For-Review: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T187442#3975346 (10Dzahn) 05Open>03Resolved [05:26:51] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T187442#3975346 (10Dzahn) [05:29:32] 10Operations, 10vm-requests, 10Patch-For-Review: Site: eqiad|codfw VM request for Kafka Burrow Lag monitoring - https://phabricator.wikimedia.org/T187901#3995286 (10Dzahn) Fri Feb 23 05:18:56 2018 - INFO: - device disk/0: 100.00% done, 0s remaining (estimated) Fri Feb 23 05:18:57 2018 - INFO: Instance kafk... [05:33:32] (03PS1) 10Dzahn: DHCP: add MACs for kafkamon1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/413670 (https://phabricator.wikimedia.org/T187901) [05:35:34] (03CR) 10Dzahn: [C: 032] "MAC addresses from "gnt-instance info .. | grep MAC" on ganeti1004/ganeti2004 after creating fresh VMs" [puppet] - 10https://gerrit.wikimedia.org/r/413670 (https://phabricator.wikimedia.org/T187901) (owner: 10Dzahn) [05:40:14] !log ganeti1004 - initial startup of kafkamon1001 - booting to PXE, installing stretch (T187901) [05:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:30] T187901: Site: eqiad|codfw VM request for Kafka Burrow Lag monitoring - https://phabricator.wikimedia.org/T187901 [05:54:47] RECOVERY - puppet last run on cp1052 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [05:58:03] !log puppetmaster1001 - signing puppet certs for kafkamon1001/kafkamon2001 - initial puppet runs, adding as role spare (T187901) [05:58:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:17] T187901: Site: eqiad|codfw VM request for Kafka Burrow Lag monitoring - https://phabricator.wikimedia.org/T187901 [06:02:13] (03PS1) 10Dzahn: site: add kafkamon1001/2001 with role test [puppet] - 10https://gerrit.wikimedia.org/r/413671 (https://phabricator.wikimedia.org/T187805) [06:04:08] (03CR) 10Dzahn: [C: 032] site: add kafkamon1001/2001 with role test [puppet] - 10https://gerrit.wikimedia.org/r/413671 (https://phabricator.wikimedia.org/T187805) (owner: 10Dzahn) [06:17:19] (03PS1) 10Dzahn: introduce role(kafkamon) and make new VMs use it [puppet] - 10https://gerrit.wikimedia.org/r/413672 (https://phabricator.wikimedia.org/T187805) [06:18:39] 10Operations, 10ops-codfw, 10DBA: db2048: RAID with predictive failure - https://phabricator.wikimedia.org/T187983#3995318 (10Marostegui) 05Open>03Resolved All good now - thanks Papaul! ``` logicaldrive 1 (3.3 TB, RAID 1+0, OK) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)... [06:20:15] (03PS1) 10Dzahn: webserver_misc_apps: remove kafka related includes [puppet] - 10https://gerrit.wikimedia.org/r/413673 (https://phabricator.wikimedia.org/T187805) [06:23:10] 10Operations, 10vm-requests, 10Patch-For-Review: Site: eqiad|codfw VM request for Kafka Burrow Lag monitoring - https://phabricator.wikimedia.org/T187901#3995321 (10Dzahn) 05Open>03Resolved a:03Dzahn - created VMs - installed with stretch - signed puppet certs on master, added to site with role(test) -... [06:26:08] 10Operations, 10Patch-For-Review, 10User-Elukey: Ganeti instances to support Kafka Burrow Consumer lag monitoring - https://phabricator.wikimedia.org/T187805#3995337 (10Dzahn) 2 VMs have been created and are up and running. details in subtask linked above. Next see the Gerrit changes above for making a new... [06:30:29] (03CR) 10Marostegui: [C: 031] mariadb: Pool db2090 for the first time on s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413439 (https://phabricator.wikimedia.org/T170662) (owner: 10Jcrespo) [06:33:35] (03PS1) 10Marostegui: db-eqiad.php: Depool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413674 (https://phabricator.wikimedia.org/T187089) [06:35:20] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413674 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [06:36:48] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413674 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [06:36:58] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413674 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [06:37:33] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1947 bytes in 0.087 second response time [06:38:55] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1090 for alter table (duration: 01m 13s) [06:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:01] !log Deploy schema change on db1090 - T187089 T185128 T153182 [06:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:16] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [06:40:16] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [06:40:16] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [06:55:17] !log Reboot db2093 to test /srv auto-mounting [06:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:33] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1938 bytes in 0.123 second response time [07:32:56] (03PS1) 10Marostegui: db1076: Change the binlog to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/413677 (https://phabricator.wikimedia.org/T186321) [07:34:28] (03CR) 10Marostegui: [C: 032] db1076: Change the binlog to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/413677 (https://phabricator.wikimedia.org/T186321) (owner: 10Marostegui) [07:34:33] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1945 bytes in 0.088 second response time [07:34:43] PROBLEM - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/dumps - 288 bytes in 0.031 second response time [07:35:01] apergos: ^ [07:35:11] hi, https://phabricator.wikimedia.org/T45952 this is important [07:51:59] (03PS1) 10Marostegui: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413678 (https://phabricator.wikimedia.org/T162807) [07:52:58] (03PS2) 10Marostegui: db-eqiad.php: Depool db1083, fully repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413678 (https://phabricator.wikimedia.org/T162807) [07:54:55] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1083, fully repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413678 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [07:56:17] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1083, fully repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413678 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [07:56:34] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1083, fully repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413678 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [07:58:05] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1083, fully repool db1089 - T162807 (duration: 01m 12s) [07:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:21] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [08:05:32] !log MariaDB and kernel upgrade on db1083 [08:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:33] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1946 bytes in 0.099 second response time [08:10:03] (03CR) 10Elukey: "Done thanks! I suppose that after https://gerrit.wikimedia.org/r/#/c/402069/ this shouldn't be a problem anymore right? Maybe I should wai" [puppet] - 10https://gerrit.wikimedia.org/r/413405 (https://phabricator.wikimedia.org/T184795) (owner: 10Elukey) [08:21:33] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1953 bytes in 0.100 second response time [08:23:56] 10Operations, 10hardware-requests: Site: (2) hardware access request for videoscalers - https://phabricator.wikimedia.org/T188075#3995477 (10brion) [08:26:33] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1931 bytes in 0.104 second response time [08:27:25] 10Operations, 10hardware-requests: Site: (2) hardware access request for videoscalers - https://phabricator.wikimedia.org/T188075#3995500 (10brion) [08:27:29] 10Operations, 10TimedMediaHandler: Assign 3 more servers to video scaler duty - https://phabricator.wikimedia.org/T114337#3995502 (10brion) [08:32:41] (03CR) 10Joal: [C: 031] "Let's do that :)" (031 comment) [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/405894 (https://phabricator.wikimedia.org/T185581) (owner: 10Ottomata) [08:38:13] PROBLEM - High CPU load on API appserver on mw1230 is CRITICAL: CRITICAL - load average: 34.59, 33.10, 32.07 [08:45:38] 10Operations, 10vm-requests, 10Patch-For-Review: Site: eqiad|codfw VM request for Kafka Burrow Lag monitoring - https://phabricator.wikimedia.org/T187901#3989586 (10akosiaris) Nice work. Thanks! [08:49:58] 10Operations, 10vm-requests, 10Patch-For-Review: Site: eqiad|codfw VM request for Kafka Burrow Lag monitoring - https://phabricator.wikimedia.org/T187901#3995527 (10elukey) Indeed, thank you! [08:51:55] (03PS1) 10Elukey: role::configcluster: update zookeeper's ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/413685 (https://phabricator.wikimedia.org/T187805) [08:53:30] (03CR) 10Filippo Giunchedi: [C: 031] role::aqs: enable Cassandra JMX exporter [puppet] - 10https://gerrit.wikimedia.org/r/413405 (https://phabricator.wikimedia.org/T184795) (owner: 10Elukey) [08:55:14] PROBLEM - High CPU load on API appserver on mw1230 is CRITICAL: CRITICAL - load average: 40.33, 33.98, 32.39 [09:00:14] (03CR) 10Alexandros Kosiaris: [C: 04-1] "No, it is still being used by https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/puppetmaster/lib/puppet" [puppet] - 10https://gerrit.wikimedia.org/r/391336 (owner: 10Paladox) [09:00:29] 10Operations, 10hardware-requests: Site: (2) hardware access request for videoscalers - https://phabricator.wikimedia.org/T188075#3995550 (10brion) General capacity note: current version of libvpx can use a varying number of threads for VP9 encoding depending on the resolution. At our current resolutions, this... [09:02:14] PROBLEM - High CPU load on API appserver on mw1230 is CRITICAL: CRITICAL - load average: 32.44, 32.16, 32.15 [09:03:39] 10Operations, 10Puppet: Upgrade Puppet Master Infrastructure to Debian Stretch - https://phabricator.wikimedia.org/T184562#3888192 (10fgiunchedi) I'll take a stab at this, first provisioning a stretch vm on wmcs and applying the relevant roles. [09:06:23] PROBLEM - High CPU load on API appserver on mw1230 is CRITICAL: CRITICAL - load average: 32.50, 32.00, 32.07 [09:06:53] 10Operations, 10Puppet: Upgrade Puppet Master Infrastructure to Debian Stretch - https://phabricator.wikimedia.org/T184562#3995560 (10Paladox) When I last used puppetmaster on stretch, one issue is that ruby-mysql is not in stretch. There’s a ruby-mysql2. But I think we can get rid of ruby-mysql now that we ar... [09:08:05] (03PS1) 10Elukey: [WIP] Introduce new kafka::monitoring::eqiad|codfw roles [puppet] - 10https://gerrit.wikimedia.org/r/413687 [09:12:00] 10Operations, 10Proton, 10Readers-Web-Backlog, 10Services (watching): Choose a server for the chromium-render service - https://phabricator.wikimedia.org/T187821#3995565 (10akosiaris) >>! In T187821#3990425, @mobrovac wrote: > Given the requirements, I would be inclined to say Kubernetes, but we don't have... [09:17:23] PROBLEM - High CPU load on API appserver on mw1230 is CRITICAL: CRITICAL - load average: 32.51, 31.89, 32.03 [09:19:23] PROBLEM - High CPU load on API appserver on mw1230 is CRITICAL: CRITICAL - load average: 33.51, 32.20, 32.11 [09:22:23] PROBLEM - High CPU load on API appserver on mw1230 is CRITICAL: CRITICAL - load average: 32.81, 32.25, 32.10 [09:22:36] this time seems different [09:31:12] (03CR) 10Volans: [C: 04-1] "Given it's pretty easy to do, let's try to make it backward compatible and avoid to add another moving part do the migration process." [puppet] - 10https://gerrit.wikimedia.org/r/413435 (https://phabricator.wikimedia.org/T188032) (owner: 10Herron) [09:33:53] 10Operations, 10Puppet: Port puppetlabs PuppetDB 4.4 package to stretch - https://phabricator.wikimedia.org/T185502#3917661 (10fgiunchedi) I noticed while working on puppetmaster on stretch that we didn't have a git repo to host puppetdb source (packages), so I created `operations/debs/puppetdb` for this purpo... [09:34:46] 10Operations, 10ops-codfw, 10monitoring: db2037 IPMI not working - https://phabricator.wikimedia.org/T188016#3995597 (10Volans) The pasted command is without the 'mgmt' part, it seems to work for me adding it: ``` $ sudo ipmitool -I lanplus -H "db2037.mgmt.codfw.wmnet" -U root -E chassis power status Unable... [09:38:04] 10Operations, 10Mathoid, 10Prod-Kubernetes, 10Kubernetes, and 3 others: Serve at least 50% of Mathoid via kubernetes - https://phabricator.wikimedia.org/T184919#3995603 (10akosiaris) Ah I think we had a small misunderstanding. By chart I meant the overloaded/abused by helm[1] term of https://en.wikipedia.o... [09:43:06] 10Operations, 10ops-codfw, 10ops-eqiad, 10monitoring: Unresponsive/misconfigured iDRACs over the host-BMC interface - https://phabricator.wikimedia.org/T169360#3995612 (10Marostegui) [09:43:08] 10Operations, 10ops-codfw, 10monitoring: db2037 IPMI not working - https://phabricator.wikimedia.org/T188016#3995609 (10Marostegui) 05Open>03Resolved a:03Papaul Indeed - good catch! So looks like the power drain worked fine [09:44:09] (03CR) 10Jcrespo: [C: 032] mariadb: Pool db2090 for the first time on s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413439 (https://phabricator.wikimedia.org/T170662) (owner: 10Jcrespo) [09:44:11] (03PS2) 10Jcrespo: mariadb: Pool db2090 for the first time on s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413439 (https://phabricator.wikimedia.org/T170662) [09:49:36] (03PS1) 10Filippo Giunchedi: puppetmaster: use puppetdb-termini on stretch [puppet] - 10https://gerrit.wikimedia.org/r/413690 (https://phabricator.wikimedia.org/T184562) [09:50:28] 10Operations, 10Puppet, 10Patch-For-Review: Upgrade Puppet Master Infrastructure to Debian Stretch - https://phabricator.wikimedia.org/T184562#3995623 (10fgiunchedi) >>! In T184562#3995560, @Paladox wrote: > When I last used puppetmaster on stretch, one issue is that ruby-mysql is not in stretch. There’s a r... [09:50:43] !log restart hhvm on mw1227 [09:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:16] (03CR) 10Jon Harald Søby: Change namespaces on urwiktionary (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407901 (https://phabricator.wikimedia.org/T186393) (owner: 10Zoranzoki21) [09:54:09] !log restart hhvm on mw1286 [09:54:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:45] !log restart hhvm on mw1230 [10:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:23] (03PS24) 10Jon Harald Søby: Add namespaces to urwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407901 (https://phabricator.wikimedia.org/T186393) (owner: 10Zoranzoki21) [10:04:13] PROBLEM - Disk space on releases1001 is CRITICAL: DISK CRITICAL - free space: / 4716 MB (3% inode=74%) [10:05:28] (03CR) 10Filippo Giunchedi: "Not strictly needed since puppetdb-termini provides/replaces puppetdb-terminus, but we'll have to rename sooner or later anyway." [puppet] - 10https://gerrit.wikimedia.org/r/413690 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [10:07:19] (03CR) 10jenkins-bot: mariadb: Pool db2090 for the first time on s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413439 (https://phabricator.wikimedia.org/T170662) (owner: 10Jcrespo) [10:08:10] !log jynus@tin Synchronized wmf-config/db-codfw.php: Pool db2090 for the first time (duration: 01m 12s) [10:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:33] RECOVERY - High CPU load on API appserver on mw1230 is OK: OK - load average: 4.84, 13.50, 23.33 [10:09:03] (03CR) 10MarcoAurelio: [C: 031] Disable Flow extension on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408073 (https://phabricator.wikimedia.org/T186463) (owner: 10Zoranzoki21) [10:09:39] /var/lib/jenkins on releases1001 is ~99G [10:19:01] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Pool db2090 for the first time (duration: 01m 12s) [10:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:14] RECOVERY - Disk space on releases1001 is OK: DISK OK [10:30:00] !log releases1001: sudo -u jenkins rm -fR /var/lib/jenkins/jobs/mediawiki-private-nightlies/workspace/BRANCH/REL1_??/mediawiki-snapshot-REL1_??-2018???? # T188080 [10:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:16] T188080: releases1001 has full / partition - https://phabricator.wikimedia.org/T188080 [10:37:21] (03PS1) 10Vgutierrez: Report coverage stats. Configure flake8 properly. [debs/pybal] - 10https://gerrit.wikimedia.org/r/413697 [10:37:23] (03PS1) 10Vgutierrez: Make flake8 happy [debs/pybal] - 10https://gerrit.wikimedia.org/r/413698 [10:38:08] (03CR) 10jerkins-bot: [V: 04-1] Report coverage stats. Configure flake8 properly. [debs/pybal] - 10https://gerrit.wikimedia.org/r/413697 (owner: 10Vgutierrez) [10:55:26] (03CR) 10Giuseppe Lavagetto: Report coverage stats. Configure flake8 properly. (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/413697 (owner: 10Vgutierrez) [10:56:10] 10Operations, 10Pybal, 10Traffic: Some etcd connections not established at startup - https://phabricator.wikimedia.org/T188087#3995831 (10ema) [10:56:35] 10Operations, 10Pybal, 10Traffic: Pybal stuck at BGP state OPENSENT while the other peer reached ESTABLISHED - https://phabricator.wikimedia.org/T188085#3995835 (10ema) [10:57:20] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1090" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413701 [10:57:23] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1090" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413701 [11:00:07] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1090" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413701 (owner: 10Marostegui) [11:01:43] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1090" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413701 (owner: 10Marostegui) [11:02:16] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1090" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413701 (owner: 10Marostegui) [11:02:27] !log installing kernel updates on mw* in codfw [11:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:22] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1090 after alter table (duration: 01m 12s) [11:03:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:39] (03CR) 10Muehlenhoff: [C: 031] role::configcluster: update zookeeper's ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/413685 (https://phabricator.wikimedia.org/T187805) (owner: 10Elukey) [11:08:33] (03PS2) 10Vgutierrez: Report coverage stats. Configure flake8 properly. [debs/pybal] - 10https://gerrit.wikimedia.org/r/413697 [11:09:30] (03CR) 10jerkins-bot: [V: 04-1] Report coverage stats. Configure flake8 properly. [debs/pybal] - 10https://gerrit.wikimedia.org/r/413697 (owner: 10Vgutierrez) [11:09:34] (03PS2) 10Elukey: [WIP] Introduce new kafka::monitoring::eqiad|codfw roles [puppet] - 10https://gerrit.wikimedia.org/r/413687 [11:09:47] (03PS3) 10Vgutierrez: Report coverage stats. Configure flake8 properly. [debs/pybal] - 10https://gerrit.wikimedia.org/r/413697 [11:10:44] (03CR) 10Vgutierrez: Report coverage stats. Configure flake8 properly. (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/413697 (owner: 10Vgutierrez) [11:10:46] (03CR) 10jerkins-bot: [V: 04-1] Report coverage stats. Configure flake8 properly. [debs/pybal] - 10https://gerrit.wikimedia.org/r/413697 (owner: 10Vgutierrez) [11:11:43] 10Operations, 10MediaWiki-Configuration, 10User-Joe, 10discovery-system: Test EtcdConfig in different failure scenarios - https://phabricator.wikimedia.org/T185078#3995873 (10Joe) I dug into the code a bit and turns out my testing strategy was flawed: since the cache key **depends on the hostname**, by cha... [11:11:52] 10Operations, 10Traffic: VCL discards crash varnish frontend child process - https://phabricator.wikimedia.org/T188089#3995874 (10ema) p:05Triage>03High [11:12:34] 10Operations, 10UniversalLanguageSelector, 10I18n: MB Lateefi Fonts for Sindhi Wikipedia. - https://phabricator.wikimedia.org/T138136#3995879 (10Aklapper) @mehtab.ahmed: Which "font issue" exactly, how and where? Also why did you remove me as a subscriber on this task? [11:12:38] (03CR) 10Ema: [C: 04-2] "Do not merge: https://phabricator.wikimedia.org/T188089" [puppet] - 10https://gerrit.wikimedia.org/r/412737 (owner: 10Ema) [11:23:54] (03PS1) 10Marostegui: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413703 (https://phabricator.wikimedia.org/T186321) [11:24:21] (03PS4) 10Jcrespo: mariadb: Fix and standarize firewall holes to all cloud-related mariadbs [puppet] - 10https://gerrit.wikimedia.org/r/413375 (https://phabricator.wikimedia.org/T184704) [11:25:49] (03CR) 10Elukey: "I don't love the idea of having ::eqiad/::codfw in role names, so if anybody has a better idea please let me know. For example, burrow ana" [puppet] - 10https://gerrit.wikimedia.org/r/413687 (owner: 10Elukey) [11:25:53] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413703 (https://phabricator.wikimedia.org/T186321) (owner: 10Marostegui) [11:27:16] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413703 (https://phabricator.wikimedia.org/T186321) (owner: 10Marostegui) [11:27:32] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413703 (https://phabricator.wikimedia.org/T186321) (owner: 10Marostegui) [11:28:54] (03CR) 10Jcrespo: "the eqiad and codfw one only seem to differ on the description, that can be gotten from puppet facts ?" [puppet] - 10https://gerrit.wikimedia.org/r/413687 (owner: 10Elukey) [11:28:59] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1076 for binlog format change - T186321 (duration: 01m 08s) [11:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:13] T186321: Prepare and indicate proper master db failover candidates for all database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T186321 [11:29:17] !log Restart mariadb on db1076 for binlog format change - T186321 [11:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:03] PROBLEM - puppet last run on mw2107 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[libgcc1-dbg] [11:32:25] (03CR) 10Jcrespo: "Not sure if facts or variables, as they are similar. But even if they were for some reason unable to be used there is also $(cat /etc/wiki" [puppet] - 10https://gerrit.wikimedia.org/r/413687 (owner: 10Elukey) [11:33:41] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413704 [11:34:33] (03CR) 10Elukey: "> the eqiad and codfw one only seem to differ on the description," [puppet] - 10https://gerrit.wikimedia.org/r/413687 (owner: 10Elukey) [11:35:40] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413704 (owner: 10Marostegui) [11:37:08] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413704 (owner: 10Marostegui) [11:37:41] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413704 (owner: 10Marostegui) [11:37:53] PROBLEM - puppet last run on mw2138 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[libgcc1-dbg] [11:38:56] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool db1076 - T186321 (duration: 01m 13s) [11:39:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:10] T186321: Prepare and indicate proper master db failover candidates for all eqiad database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T186321 [11:39:24] (03PS5) 10Jcrespo: mariadb: Fix and standarize firewall holes to all cloud-related mariadbs [puppet] - 10https://gerrit.wikimedia.org/r/413375 (https://phabricator.wikimedia.org/T184704) [11:40:12] (03CR) 10Jcrespo: [C: 031] "I think this is ready to be deployed: https://puppet-compiler.wmflabs.org/compiler02/10120/" [puppet] - 10https://gerrit.wikimedia.org/r/413375 (https://phabricator.wikimedia.org/T184704) (owner: 10Jcrespo) [11:44:16] 10Operations, 10UniversalLanguageSelector, 10I18n: MB Lateefi Fonts for Sindhi Wikipedia. - https://phabricator.wikimedia.org/T138136#3995958 (10mehtab.ahmed) @Aklapper: few days ago @BukhariSaeed asked me to update https ://sd.wikipedia.org/wiki/%D8%B0%D8%B1%D9%8A%D8%B9%D8%A7%D8%AA_%D9%88%DA%AA%D9%8A:Common... [11:44:35] 10Operations, 10Patch-For-Review, 10User-Elukey: Ganeti instances to support Kafka Burrow Consumer lag monitoring - https://phabricator.wikimedia.org/T187805#3995961 (10elukey) @Dzahn I started https://gerrit.wikimedia.org/r/#/c/413687/ because I have some constraints to apply for kafkamon hosts, not sure if... [11:45:03] PROBLEM - puppet last run on mw2235 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[libgcc1-dbg] [12:00:03] RECOVERY - puppet last run on mw2107 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:02:13] (03PS1) 10Marostegui: prometheus: Add db1115 and db2093 [puppet] - 10https://gerrit.wikimedia.org/r/413709 (https://phabricator.wikimedia.org/T184704) [12:04:56] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413711 [12:05:54] (03CR) 10Marostegui: [C: 032] prometheus: Add db1115 and db2093 [puppet] - 10https://gerrit.wikimedia.org/r/413709 (https://phabricator.wikimedia.org/T184704) (owner: 10Marostegui) [12:06:18] (03PS1) 10Jcrespo: tendril: Revert all db optimizations except the default table engine [puppet] - 10https://gerrit.wikimedia.org/r/413712 (https://phabricator.wikimedia.org/T184704) [12:06:33] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413711 (owner: 10Marostegui) [12:07:02] 10Operations, 10Proton, 10Readers-Web-Backlog, 10Services (watching): Choose a server for the chromium-render service - https://phabricator.wikimedia.org/T187821#3996012 (10phuedx) >>! In T187821#3994402, @Niedzielski wrote: > 1. Install Proton to a virtual machine. This isn't out of the question, as @pmi... [12:07:36] (03PS1) 10Urbanecm: Define new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413713 [12:07:44] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413711 (owner: 10Marostegui) [12:07:53] RECOVERY - puppet last run on mw2138 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:07:54] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413711 (owner: 10Marostegui) [12:08:32] (03PS2) 10Urbanecm: Define new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413713 (https://phabricator.wikimedia.org/T188090) [12:08:55] (03CR) 10Marostegui: [C: 031] tendril: Revert all db optimizations except the default table engine [puppet] - 10https://gerrit.wikimedia.org/r/413712 (https://phabricator.wikimedia.org/T184704) (owner: 10Jcrespo) [12:09:14] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1076 - T186321 (duration: 01m 12s) [12:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:29] T186321: Prepare and indicate proper master db failover candidates for all eqiad database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T186321 [12:09:36] Hi, anybody around to deploy 413713 / T188090 ? zeljkof, twentyafterfour, hashar, marostegui? [12:09:36] T188090: Request to throttle account creation limit for Hong Kong Tutorial - https://phabricator.wikimedia.org/T188090 [12:11:38] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413714 [12:12:31] no_justification ^^ [12:12:45] dereckson, ^^ [12:13:25] Urbanecm: I normally only touch db-eqiad and db-codfw so I would prefer if someone from releng take care of your change :) [12:13:38] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413714 (owner: 10Marostegui) [12:14:04] marostegui, I know, but I'm seeing you're around in this time and certainly have deploy privs and technically can deploy it... [12:14:40] Urbanecm: I hashar and zeljkof might be around too [12:14:51] (03PS2) 10Jcrespo: tendril: Revert all db optimizations except the default table engine [puppet] - 10https://gerrit.wikimedia.org/r/413712 (https://phabricator.wikimedia.org/T184704) [12:15:03] RECOVERY - puppet last run on mw2235 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:15:03] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413714 (owner: 10Marostegui) [12:15:26] marostegui, pinged them already :) [12:15:44] Urbanecm: sure which gerrit change? :) [12:15:56] hashar, 413713 [12:16:17] Full link: https://gerrit.wikimedia.org/r/413713 [12:16:39] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1083 (duration: 01m 21s) [12:16:46] Urbanecm: will do in a few [12:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:00] thank you [12:17:00] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413714 (owner: 10Marostegui) [12:17:21] I find those last minute throttle requests abusive [12:18:39] Hauskatze, they should be somehow warned way before events, maybe some kind of preventive e-mail to...all chapters? [12:19:27] let me clarify Urbanecm -- it's obviosly not abusive from you [12:19:53] but if we have deployment windows and guides asking people to warn us days in advance... [12:20:29] I understood it :). Just thinking about how to prevent such last minute things... Maybe too brutal, but what about declining all such requests even we can try to do it? [12:20:33] in any case, hopefully one day this kind of requests could be manageable on-wiki [12:20:45] You mean throttleoverride probably... [12:21:13] ...which is waiting on deployment 7 years already [12:21:14] I'd certainly decline them, or at least warn them that this is the first-and-the-last time [12:21:20] yep, that one [12:22:02] "Las cosas de Palacio van despacio" its a say that probably apply in that case (means: Palace things go slowly, probably doesn't make sense in any other language). [12:22:06] marostegui: let me know when you are done with the mediawiki-config db change. I will deploy the throttle rule Urbanecm asked for [12:22:40] hashar: go for it! [12:23:02] Hauskatze: Urbanecm: ideally changing a throttle should not require a deployment :] [12:23:24] that would ideally be made available to end users on meta / some global special page [12:23:34] hashar: the mediawiki extension suposed to do that is waiting on deployment for 7 years as Urbanecm said [12:23:39] ah [12:23:51] hashar, that's what T27000 is about, we have this page for 7 years, but nobody did security review etc [12:23:51] T27000: Deploy ThrottleOverride extension to Wikimedia wikis - https://phabricator.wikimedia.org/T27000 [12:23:57] well guess we could look at getting it polished up / tested and maybe get it deployed? [12:24:00] Palace things go slowly :P [12:24:07] you can probably bring it on to wikitech-l [12:25:02] (03CR) 10Hashar: [C: 032] Define new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413713 (https://phabricator.wikimedia.org/T188090) (owner: 10Urbanecm) [12:25:04] but it won't change too much I guess. It won't require deployers, but sysops which is (at least for my wiki) around 30 in total and only 5-10 are available on daily basis ;) [12:26:30] (03Merged) 10jenkins-bot: Define new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413713 (https://phabricator.wikimedia.org/T188090) (owner: 10Urbanecm) [12:28:10] !log hashar@tin Synchronized wmf-config/throttle.php: Define new throttle rule - T188090 (duration: 01m 11s) [12:28:20] Urbanecm: deployed!!! :] [12:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:25] T188090: Request to throttle account creation limit for Hong Kong Tutorial - https://phabricator.wikimedia.org/T188090 [12:28:28] hashar, thank you! [12:28:46] Urbanecm: and again thank you for taking care of all those requests :-] [12:30:16] hashar: https://gerrit.wikimedia.org/r/413715 [12:30:21] for submit [12:30:23] meh [12:31:20] Hauskatze, what is it if I may ask? [12:31:31] a new project [12:31:40] they asked for it [12:31:53] I'm configuring it [12:31:55] Hauskatze: done [12:31:59] ty [12:32:02] and then I guess we can look at adding CI jobs to it [12:37:01] when some content is added, perhaps [12:39:25] (03CR) 10jenkins-bot: Define new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413713 (https://phabricator.wikimedia.org/T188090) (owner: 10Urbanecm) [12:41:13] (03PS9) 10Gehel: wdqs: allow configuration of kafka based updates [puppet] - 10https://gerrit.wikimedia.org/r/412873 (https://phabricator.wikimedia.org/T185951) [12:43:29] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic db1083,db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413717 [12:50:27] 10Operations, 10Mathoid, 10Prod-Kubernetes, 10Kubernetes, and 3 others: Serve at least 50% of Mathoid via kubernetes - https://phabricator.wikimedia.org/T184919#3996105 (10mobrovac) The latencies as witnessed by RESTBase clients are available on [this panel](https://grafana.wikimedia.org/dashboard/db/restb... [12:57:54] (03PS3) 10Jcrespo: tendril: Revert all db optimizations except the default table engine [puppet] - 10https://gerrit.wikimedia.org/r/413712 (https://phabricator.wikimedia.org/T184704) [13:07:07] hashar: can I deploy? [13:08:28] marostegui: yes ! [13:08:33] thanks! [13:08:42] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic db1083,db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413717 (owner: 10Marostegui) [13:10:11] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic db1083,db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413717 (owner: 10Marostegui) [13:10:22] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic db1083,db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413717 (owner: 10Marostegui) [13:11:45] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1083 and fully repool db1076 (duration: 01m 13s) [13:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:20] (03PS1) 10Urbanecm: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413722 (https://phabricator.wikimedia.org/T188091) [13:28:13] hashar, sorry to write again: another last minute request T188091, https://gerrit.wikimedia.org/r/413722... Can you do the rest? ;) [13:28:14] T188091: Raise throttling cap on user registration, image upload on commons.wikimedia.org, te.wikipedia.org and te.wikisource.org on 2018-02-24 to 2018-02-25 - https://phabricator.wikimedia.org/T188091 [13:28:33] (03CR) 10Elukey: "The alternative and probably better approach is to have one profile that grabs a hash (or similar) from hiera containing all the parameter" [puppet] - 10https://gerrit.wikimedia.org/r/413687 (owner: 10Elukey) [13:32:42] 10Operations: Add Prometheus collector for Tor - https://phabricator.wikimedia.org/T188098#3996174 (10MoritzMuehlenhoff) [13:41:05] (03PS1) 10Lucas Werkmeister (WMDE): Enable caching of constraint check results [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413724 (https://phabricator.wikimedia.org/T184812) [13:41:33] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] "Do not merge before wmf.22 is deployed on wikidatawiki!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413724 (https://phabricator.wikimedia.org/T184812) (owner: 10Lucas Werkmeister (WMDE)) [13:42:03] (03PS1) 10Arturo Borrero Gonzalez: toollabs: apt_pinning: be more strict in linux kernel pinning [puppet] - 10https://gerrit.wikimedia.org/r/413725 (https://phabricator.wikimedia.org/T187193) [13:43:39] (03CR) 10Jcrespo: [C: 032] tendril: Revert all db optimizations except the default table engine [puppet] - 10https://gerrit.wikimedia.org/r/413712 (https://phabricator.wikimedia.org/T184704) (owner: 10Jcrespo) [13:43:55] (03CR) 10Arturo Borrero Gonzalez: [C: 032] toollabs: apt_pinning: be more strict in linux kernel pinning [puppet] - 10https://gerrit.wikimedia.org/r/413725 (https://phabricator.wikimedia.org/T187193) (owner: 10Arturo Borrero Gonzalez) [13:44:03] (03PS2) 10Arturo Borrero Gonzalez: toollabs: apt_pinning: be more strict in linux kernel pinning [puppet] - 10https://gerrit.wikimedia.org/r/413725 (https://phabricator.wikimedia.org/T187193) [13:44:25] !log reboot ocg1003 for tests [13:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:08] jynus: please go ahead with your merge [13:46:19] doing [13:46:33] it takes some time to find the tab among the 20 [13:46:53] do I merge yours, too, I guess yes? [13:47:06] well, yes [13:47:23] I saw also 2 patches, that's why I ping'ed you [13:47:36] not sure how we proceed in these cases [13:47:44] ok, I got confused because you said to merge mine [13:48:00] normally, just ping the other party and merge the 2 if said yes [13:48:06] k [13:48:21] if the party is unresponsive and you do not feel confortable deploying it [13:48:30] revert on gerrint and deploy the 3 [13:48:38] k [13:48:41] thanks! [13:56:08] 10Operations, 10hardware-requests: Site: (2) hardware access request for videoscalers - https://phabricator.wikimedia.org/T188075#3996214 (10MoritzMuehlenhoff) The image scalers currently only serve requests for internal wikis and Gilles is in the process of also moving those to Thumbor. Once completed, we cou... [13:57:30] Hauskatze: hello. Generally, people take care to request *in advance* when they're used to plan events. But, first timer organizers struggle to find the process. Another problem is sometimes the definitive IP address ranges are complicated to get. [13:58:20] dereckson: there are people who always wait til the last minute [13:58:58] but if you (the wmf-deployment people) is okay with that [13:59:09] then I have no objections [14:00:06] 10Operations, 10hardware-requests: Site: (2) hardware access request for videoscalers - https://phabricator.wikimedia.org/T188075#3996229 (10brion) @MoritzMuehlenhoff Great! What are the specs on those, for reference? [14:00:20] I think a more stable long-term solution would be to handle these requests on wiki (see https://phabricator.wikimedia.org/T27000, a 5 digits task) [14:00:32] but pending that, there is not a lot we can do. [14:00:36] throttleoverride? [14:00:42] yes [14:00:52] heh, good luck, 7 years in the queue [14:01:12] if there's anything I can do to get that thing moving let me know [14:01:15] (and to decline a request would mean we would also give a bad experience to first comers, ruining outreach effort) [14:01:52] test the extension in a multiwiki context, and see the UI is fine to add rules would be nice [14:02:25] if they deploy to beta cluster I can test, otherwise I'm afraid I would not be able to do that [14:02:31] I cannot install mw-vagrant [14:07:48] 10Operations, 10hardware-requests: Site: (2) hardware access request for videoscalers - https://phabricator.wikimedia.org/T188075#3996236 (10MoritzMuehlenhoff) Our current six eqiad image scalers are Dell PowerEdge R430 with 40 cores (Xeon CPU E5-2650 v3 @ 2.30GHz) and 64G RAM. [14:08:37] marostegui: a good time to reboot tendril? [14:08:51] he may be out, I think [14:08:59] so that probably means yes [14:09:47] !log restarting tendril database- will case unavailability of dbtree for a while [14:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:08] 10Operations, 10hardware-requests: Site: (2) hardware access request for videoscalers - https://phabricator.wikimedia.org/T188075#3996242 (10brion) Is that dual-socket for 20 cores/40 threads or quad-socket for 40 cores/80 threads? Hyperthreading makes everything confusing. ;) Either way those should work very... [14:16:03] (03PS1) 10Elukey: role::webserver_misc_apps: refactor kafka Burrow configuration [puppet] - 10https://gerrit.wikimedia.org/r/413728 [14:18:18] 10Operations, 10hardware-requests: Site: (2) hardware access request for videoscalers - https://phabricator.wikimedia.org/T188075#3996249 (10brion) (And would all 6 be available for video scaler use -- I'll happily take them! -- or would we share with a bigger pool?) [14:19:02] (03PS2) 10Elukey: role::webserver_misc_apps: refactor kafka Burrow configuration [puppet] - 10https://gerrit.wikimedia.org/r/413728 [14:22:05] (03PS3) 10Elukey: role::webserver_misc_apps: refactor kafka Burrow configuration [puppet] - 10https://gerrit.wikimedia.org/r/413728 [14:24:04] 10Operations, 10hardware-requests: Site: (2) hardware access request for videoscalers - https://phabricator.wikimedia.org/T188075#3996273 (10brion) [14:25:03] (03PS4) 10Elukey: role::webserver_misc_apps: refactor kafka Burrow configuration [puppet] - 10https://gerrit.wikimedia.org/r/413728 [14:27:14] (03CR) 10Ottomata: "Ha, this is on the spark repository sooooo...I'd assume it was spark! Great! Let's deploy this next week." [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/405894 (https://phabricator.wikimedia.org/T185581) (owner: 10Ottomata) [14:28:21] (03Abandoned) 10Elukey: [WIP] Introduce new kafka::monitoring::eqiad|codfw roles [puppet] - 10https://gerrit.wikimedia.org/r/413687 (owner: 10Elukey) [14:28:26] \o [14:29:02] (03PS5) 10Elukey: role::webserver_misc_apps: refactor kafka Burrow configuration [puppet] - 10https://gerrit.wikimedia.org/r/413728 [14:32:15] (03PS6) 10Elukey: role::webserver_misc_apps: refactor kafka Burrow configuration [puppet] - 10https://gerrit.wikimedia.org/r/413728 [14:36:25] 10Operations, 10hardware-requests: Site: (2) hardware access request for videoscalers - https://phabricator.wikimedia.org/T188075#3996299 (10MoritzMuehlenhoff) >>! In T188075#3996242, @brion wrote: > Is that dual-socket for 20 cores/40 threads or quad-socket for 40 cores/80 threads? Hyperthreading makes everyt... [14:40:23] !log installing kernel updates on API servers in codfw [14:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:08] spiffy :D [14:41:48] TIL 'lscpu' is way more succinct than dumping /proc/cpuinfo and manually looking through it [14:43:42] RECOVERY - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.042 second response time [14:44:37] (03PS7) 10Elukey: role::webserver_misc_apps: refactor kafka Burrow configuration [puppet] - 10https://gerrit.wikimedia.org/r/413728 [14:47:11] 10Operations, 10hardware-requests: Site: (2) hardware access request for videoscalers - https://phabricator.wikimedia.org/T188075#3996307 (10brion) [14:48:27] dammit i didn't remove my fixme comment [14:52:08] 10Operations, 10hardware-requests: Site: (2) hardware access request for videoscalers - https://phabricator.wikimedia.org/T188075#3996319 (10brion) [14:53:01] 10Operations, 10hardware-requests: Site: (2) hardware access request for videoscalers - https://phabricator.wikimedia.org/T188075#3995477 (10brion) [14:55:02] (03PS8) 10Elukey: Introduce role::kafka::monitoring [puppet] - 10https://gerrit.wikimedia.org/r/413728 [14:55:21] 10Operations, 10ops-codfw: rack/setup/install wdqs200[4-6] - https://phabricator.wikimedia.org/T187800#3996327 (10Gehel) [14:57:49] 10Operations, 10Continuous-Integration-Infrastructure, 10MediaWiki-Core-Tests, 10HHVM: HHVM 3.18.5+dfsg-1+wmf3 changes parse_url causing unit tests to fail - https://phabricator.wikimedia.org/T185024#3996328 (10MoritzMuehlenhoff) @greg Travis uses the debs packaged by Facebook and new releases have been ma... [14:59:31] !log update facts on puppet compiler [14:59:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:41] PROBLEM - HHVM jobrunner on mw1304 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [15:05:41] RECOVERY - HHVM jobrunner on mw1304 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [15:09:14] (03CR) 10Jcrespo: [C: 032] mariadb: Fix and standarize firewall holes to all cloud-related mariadbs [puppet] - 10https://gerrit.wikimedia.org/r/413375 (https://phabricator.wikimedia.org/T184704) (owner: 10Jcrespo) [15:09:22] (03PS6) 10Jcrespo: mariadb: Fix and standarize firewall holes to all cloud-related mariadbs [puppet] - 10https://gerrit.wikimedia.org/r/413375 (https://phabricator.wikimedia.org/T184704) [15:12:35] (03CR) 10Elukey: "pcc: https://puppet-compiler.wmflabs.org/compiler02/10131/" [puppet] - 10https://gerrit.wikimedia.org/r/413728 (owner: 10Elukey) [15:13:46] (03PS9) 10Elukey: Introduce role::kafka::monitoring [puppet] - 10https://gerrit.wikimedia.org/r/413728 (https://phabricator.wikimedia.org/T187805) [15:15:14] !log about to deploy gerrit:413375 disabling puppet on affected hosts [15:15:25] 10Operations, 10Patch-For-Review, 10User-Elukey: Ganeti instances to support Kafka Burrow Consumer lag monitoring - https://phabricator.wikimedia.org/T187805#3996384 (10elukey) The last code review makes everything configurable via hiera, using the same role for both kafkamon hosts. [15:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:21] PROBLEM - Nginx local proxy to apache on mw2204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:20:36] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413738 [15:21:12] RECOVERY - Nginx local proxy to apache on mw2204 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.202 second response time [15:23:05] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413738 (owner: 10Marostegui) [15:24:36] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413738 (owner: 10Marostegui) [15:24:39] (03CR) 10Eevans: [C: 031] "> Done thanks! I suppose that after https://gerrit.wikimedia.org/r/#/c/402069/" [puppet] - 10https://gerrit.wikimedia.org/r/413405 (https://phabricator.wikimedia.org/T184795) (owner: 10Elukey) [15:26:53] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413738 (owner: 10Marostegui) [15:27:07] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1083 (duration: 02m 21s) [15:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:25] (03PS1) 10BBlack: 1.5: bugfix for vcl_fini + nonexistent db [software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/413740 [15:42:46] (03PS2) 10BBlack: 1.5: bugfix for vcl_fini + nonexistent db [software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/413740 (https://phabricator.wikimedia.org/T188089) [15:44:21] 10Operations, 10Traffic, 10Patch-For-Review: VCL discards crash varnish frontend child process - https://phabricator.wikimedia.org/T188089#3996469 (10BBlack) [15:44:24] 10Operations, 10Traffic: varnish: discard cold vcl - https://phabricator.wikimedia.org/T187778#3996468 (10BBlack) [15:46:41] PROBLEM - HHVM jobrunner on mw1301 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [15:47:41] RECOVERY - HHVM jobrunner on mw1301 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [15:51:11] (03CR) 10Ema: [C: 031] 1.5: bugfix for vcl_fini + nonexistent db [software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/413740 (https://phabricator.wikimedia.org/T188089) (owner: 10BBlack) [15:53:09] 10Operations, 10ops-codfw, 10DC-Ops, 10hardware-requests: Decommission old and unused/spare servers in codfw - https://phabricator.wikimedia.org/T187474#3996488 (10Papaul) a:05Papaul>03RobH @RobH Can you please do the switch part and assign back to me ? Thanks. [15:56:09] 10Operations, 10ops-codfw: rack/setup/install wdqs200[4-6] - https://phabricator.wikimedia.org/T187800#3996492 (10Papaul) [15:57:01] PROBLEM - puppet last run on labsdb1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:57:04] 10Operations, 10ops-codfw: rack/setup/install wdqs200[4-6] - https://phabricator.wikimedia.org/T187800#3986247 (10Papaul) [15:57:41] (03CR) 10Bstorm: [C: 032] tools-static: Remove problematic headers from proxy responses [puppet] - 10https://gerrit.wikimedia.org/r/413469 (https://phabricator.wikimedia.org/T182604) (owner: 10Bstorm) [15:57:52] (03PS4) 10Bstorm: tools-static: Remove problematic headers from proxy responses [puppet] - 10https://gerrit.wikimedia.org/r/413469 (https://phabricator.wikimedia.org/T182604) [15:58:09] !log rebooting job runners in codfw for kernel security updates [15:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:53] (03PS1) 10Jcrespo: tendril: Fix typo for role on firewall definition [puppet] - 10https://gerrit.wikimedia.org/r/413741 [15:58:55] labsdb1010 is me [15:58:57] ^fixing [15:59:16] not sure why the compiler didn't catch the typo, sorry :-( [16:00:10] (03PS2) 10Jcrespo: tendril: Fix typo for role on firewall definition [puppet] - 10https://gerrit.wikimedia.org/r/413741 [16:00:12] (03CR) 10Andrew Bogott: [C: 031] tendril: Fix typo for role on firewall definition [puppet] - 10https://gerrit.wikimedia.org/r/413741 (owner: 10Jcrespo) [16:00:47] (03CR) 10Jcrespo: [C: 032] tendril: Fix typo for role on firewall definition [puppet] - 10https://gerrit.wikimedia.org/r/413741 (owner: 10Jcrespo) [16:02:07] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413742 [16:02:55] (03PS3) 10BBlack: 1.5: bugfix for vcl_fini + set thread name [software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/413740 (https://phabricator.wikimedia.org/T188089) [16:04:50] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully repool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413742 (owner: 10Marostegui) [16:06:21] (03PS4) 10BBlack: 1.6: bugfix for vcl_fini + set thread name [software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/413740 (https://phabricator.wikimedia.org/T188089) [16:06:25] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413742 (owner: 10Marostegui) [16:06:34] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for katielin (katie) - https://phabricator.wikimedia.org/T187623#3996523 (10RobH) @RStallman-legalteam: Can you confirm receipt of the NDA? The google sheet s... [16:06:46] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for katielin (katie) - https://phabricator.wikimedia.org/T187623#3996525 (10RobH) a:05katielin>03RStallman-legalteam [16:07:05] RECOVERY - puppet last run on labsdb1010 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:07:11] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413742 (owner: 10Marostegui) [16:08:09] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Fully repool db1083 (duration: 01m 14s) [16:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:13] (03PS1) 10Andrew Bogott: puppet-merge: add some conftool extras [puppet] - 10https://gerrit.wikimedia.org/r/413745 (https://phabricator.wikimedia.org/T157133) [16:29:29] (03PS4) 10Andrew Bogott: m5: update db grants for new labweb services [puppet] - 10https://gerrit.wikimedia.org/r/412964 (https://phabricator.wikimedia.org/T168470) [16:29:31] (03PS3) 10Andrew Bogott: m5: add ferm rules for new labweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/412970 (https://phabricator.wikimedia.org/T168470) [16:29:33] (03PS1) 10Andrew Bogott: m5: remove grants for Californium [puppet] - 10https://gerrit.wikimedia.org/r/413748 (https://phabricator.wikimedia.org/T168470) [16:30:43] (03CR) 10Andrew Bogott: [C: 032] m5: update db grants for new labweb services [puppet] - 10https://gerrit.wikimedia.org/r/412964 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [16:34:01] (03CR) 10Andrew Bogott: [C: 032] m5: add ferm rules for new labweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/412970 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [16:36:24] 10Operations, 10Traffic, 10ZeroPortal: Cannot fetch Zero carriers/proxies JSON files from eqsin - https://phabricator.wikimedia.org/T188111#3996633 (10BBlack) p:05Triage>03High [16:36:49] 10Operations, 10Traffic: Enable Service in Asia Cache DC - https://phabricator.wikimedia.org/T156026#3996648 (10BBlack) [16:37:03] !log rebooting image scalers in codfw for kernel security updates [16:37:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:33] (03CR) 10Ema: [C: 032] 1.6: bugfix for vcl_fini + set thread name [software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/413740 (https://phabricator.wikimedia.org/T188089) (owner: 10BBlack) [16:44:32] (03PS1) 10Andrew Bogott: ferm_wmcs: fix copy/paste error [puppet] - 10https://gerrit.wikimedia.org/r/413749 [16:45:36] (03CR) 10Andrew Bogott: [C: 032] ferm_wmcs: fix copy/paste error [puppet] - 10https://gerrit.wikimedia.org/r/413749 (owner: 10Andrew Bogott) [16:48:33] 10Operations, 10Traffic, 10ZeroPortal: Cannot fetch Zero carriers/proxies JSON files from eqsin - https://phabricator.wikimedia.org/T188111#3996678 (10BBlack) Assuming it is a whitelist of the private networks containing prod caches, the new additions to the list for ipv6+ipv4 would be: ``` 2001:df2:e500:10... [17:02:52] (03PS1) 10Ema: 1.6-1: bugfix for vcl_fini + set thread name [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/413754 (https://phabricator.wikimedia.org/T188089) [17:06:58] (03CR) 10Paladox: [C: 031] Gerrit: Set plugin.webhooks.sslVerify = true [puppet] - 10https://gerrit.wikimedia.org/r/413661 (owner: 10Chad) [17:07:35] (03CR) 10Paladox: [C: 031] "This is for the pending webhooks plugin we are planning to install so wont affect anything yet. (noop)" [puppet] - 10https://gerrit.wikimedia.org/r/413661 (owner: 10Chad) [17:08:09] I'm about to generate a failure for nova-fullstack fyi, it will show up here [17:09:06] (03CR) 10Ema: [C: 032] 1.6-1: bugfix for vcl_fini + set thread name [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/413754 (https://phabricator.wikimedia.org/T188089) (owner: 10Ema) [17:09:21] (03PS1) 10Andrew Bogott: striker_admin: add db grant to wasat (backup for terbium) [puppet] - 10https://gerrit.wikimedia.org/r/413755 [17:10:19] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [17:10:26] ^ me [17:10:37] (03CR) 10Andrew Bogott: [C: 032] striker_admin: add db grant to wasat (backup for terbium) [puppet] - 10https://gerrit.wikimedia.org/r/413755 (owner: 10Andrew Bogott) [17:10:47] PROBLEM - Varnish HTTP text-backend - port 3128 on cp4028 is CRITICAL: connect to address 10.128.0.128 and port 3128: Connection refused [17:11:06] oh, that seem serious [17:11:19] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [17:11:39] ema^? [17:11:47] RECOVERY - Varnish HTTP text-backend - port 3128 on cp4028 is OK: HTTP OK: HTTP/1.1 200 OK - 218 bytes in 0.157 second response time [17:12:01] maybe just a downtime? [17:14:07] 10Operations, 10Traffic, 10ZeroPortal: Cannot fetch Zero carriers/proxies JSON files from eqsin - https://phabricator.wikimedia.org/T188111#3996786 (10Mholloway) I'll look more later (have to run off to an appt soon), but one thing I notice right off the bat is that zerofetch.py is using the deprecated `acti... [17:15:23] (03PS1) 10Vgutierrez: Provide test cases for BGP parsing. [debs/pybal] - 10https://gerrit.wikimedia.org/r/413756 (https://phabricator.wikimedia.org/T188085) [17:18:19] jynus: known issue, some cron-scheduled varnish backend restarts spam here on irc because the restart takes just a bit longer than usual [17:20:25] thanks [17:22:22] !log libvmod-netmapper 1.6-1 uploaded to apt.w.o/experimental T188089 [17:22:24] (03PS9) 10Elukey: [WIP] eventlogging: add systemd support [puppet] - 10https://gerrit.wikimedia.org/r/413362 [17:22:27] 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: Rack/setup frmon1001 - https://phabricator.wikimedia.org/T186073#3996837 (10Cmjohnson) frmon1001 network port 22 both frasw-c1a and frasw-c1b [17:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:34] T188089: VCL discards crash varnish frontend child process - https://phabricator.wikimedia.org/T188089 [17:22:51] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frbast1001 - https://phabricator.wikimedia.org/T187363#3996840 (10Cmjohnson) frbast1001 port 23 both frasw-c1a and frasw-c1b [17:23:29] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frdata1001 - https://phabricator.wikimedia.org/T187364#3996841 (10Cmjohnson) frdata1001 network port 24 both frasw-c1a and frasw-c1b [17:23:55] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frpig1001 - https://phabricator.wikimedia.org/T187365#3996842 (10Cmjohnson) frpig1001 nework port 25 both frasw-c1a and frasw-c1b [17:24:39] (03PS10) 10Elukey: [WIP] eventlogging: add systemd support [puppet] - 10https://gerrit.wikimedia.org/r/413362 [17:28:01] (03CR) 10Ema: [C: 031] "LGTM, minor comment inline." (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/413756 (https://phabricator.wikimedia.org/T188085) (owner: 10Vgutierrez) [17:31:10] (03PS1) 10Muehlenhoff: Fix verbose logging in debdeploy-deploy [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/413758 [17:39:31] 10Operations, 10Traffic, 10ZeroPortal: Cannot fetch Zero carriers/proxies JSON files from eqsin - https://phabricator.wikimedia.org/T188111#3996911 (10BBlack) >>! In T188111#3996786, @Mholloway wrote: > I'll look more later (have to run off to an appt soon), but one thing I notice right off the bat is that z... [17:39:37] PROBLEM - Check systemd state on rhenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:42:27] (03PS2) 10Vgutierrez: Provide test cases for BGP parsing. [debs/pybal] - 10https://gerrit.wikimedia.org/r/413756 (https://phabricator.wikimedia.org/T188085) [17:44:30] (03CR) 10Vgutierrez: Provide test cases for BGP parsing. (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/413756 (https://phabricator.wikimedia.org/T188085) (owner: 10Vgutierrez) [17:44:51] (03CR) 10Vgutierrez: [C: 032] Provide test cases for BGP parsing. [debs/pybal] - 10https://gerrit.wikimedia.org/r/413756 (https://phabricator.wikimedia.org/T188085) (owner: 10Vgutierrez) [17:45:17] (03CR) 10Vgutierrez: [V: 032 C: 032] Provide test cases for BGP parsing. [debs/pybal] - 10https://gerrit.wikimedia.org/r/413756 (https://phabricator.wikimedia.org/T188085) (owner: 10Vgutierrez) [17:45:23] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#3996945 (10jcrespo) [17:45:28] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3996946 (10jcrespo) [17:49:22] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests: Decommission db1043 - https://phabricator.wikimedia.org/T187542#3996962 (10jcrespo) a:05jcrespo>03RobH [17:50:02] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests: Decommission db1043 - https://phabricator.wikimedia.org/T187542#3978426 (10jcrespo) [17:52:30] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decommission db2012 - https://phabricator.wikimedia.org/T187543#3996978 (10jcrespo) a:03RobH [17:52:40] yay decoms! [17:56:16] we have more coming [17:56:39] I am giving them a week after stopping using them [17:57:09] will not bother you before the week happens for the next ones [17:58:47] this is an example of the ones coming https://phabricator.wikimedia.org/search/query/wW_kGEj2XFaP/#R for each one, we quadruple the available racks space (2 2U -> 11U) [17:59:23] 10Operations, 10ops-eqiad, 10Discovery, 10Wikidata, and 2 others: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#3996993 (10Gehel) [18:11:58] (03PS1) 10Rush: openstack: nova-fullstack alert after 1 retry [puppet] - 10https://gerrit.wikimedia.org/r/413766 (https://phabricator.wikimedia.org/T178405) [18:16:37] (03CR) 10Rush: [C: 032] openstack: nova-fullstack alert after 1 retry [puppet] - 10https://gerrit.wikimedia.org/r/413766 (https://phabricator.wikimedia.org/T178405) (owner: 10Rush) [18:17:07] gilles: https://phabricator.wikimedia.org/T188122 thumbor? [18:22:38] (03CR) 10Muehlenhoff: "Should be good to go once labweb* is up and running" [puppet] - 10https://gerrit.wikimedia.org/r/380712 (owner: 10Muehlenhoff) [18:27:50] (03PS1) 10Rush: openstack: nova-api set to critical based on deployment [puppet] - 10https://gerrit.wikimedia.org/r/413770 (https://phabricator.wikimedia.org/T178405) [18:28:26] (03CR) 10jerkins-bot: [V: 04-1] openstack: nova-api set to critical based on deployment [puppet] - 10https://gerrit.wikimedia.org/r/413770 (https://phabricator.wikimedia.org/T178405) (owner: 10Rush) [18:28:32] !log mwscript extensions/Wikibase/lib/maintenance/populateSitesTable.php --wiki=enwiki --force-protocol https (T183019) [18:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:48] T183019: Wikibase must not insert local recentchanges entries for nonexistent local users (days: 5) - https://phabricator.wikimedia.org/T183019 [18:29:25] (03PS2) 10Rush: openstack: nova-api set to critical based on deployment [puppet] - 10https://gerrit.wikimedia.org/r/413770 (https://phabricator.wikimedia.org/T178405) [18:31:03] (03CR) 10Rush: [C: 032] openstack: nova-api set to critical based on deployment [puppet] - 10https://gerrit.wikimedia.org/r/413770 (https://phabricator.wikimedia.org/T178405) (owner: 10Rush) [18:35:46] (03PS1) 10Rush: openstack: nova-conductor critical by deployment [puppet] - 10https://gerrit.wikimedia.org/r/413772 (https://phabricator.wikimedia.org/T178405) [18:38:50] (03PS2) 10Rush: openstack: nova-conductor critical by deployment [puppet] - 10https://gerrit.wikimedia.org/r/413772 (https://phabricator.wikimedia.org/T178405) [18:43:14] Does anyone know why interwiki table is completely empty, everywhere? [18:43:21] how interwiki now works [18:44:05] Because we don't use it [18:44:14] Amir1_: https://noc.wikimedia.org/conf/highlight.php?file=interwiki.php [18:45:37] Reedy: oh thanks [18:49:53] (03PS11) 10Elukey: [WIP] eventlogging: add systemd support [puppet] - 10https://gerrit.wikimedia.org/r/413362 [18:50:19] (03CR) 10Rush: [C: 032] openstack: nova-conductor critical by deployment [puppet] - 10https://gerrit.wikimedia.org/r/413772 (https://phabricator.wikimedia.org/T178405) (owner: 10Rush) [18:52:04] (03PS12) 10Elukey: [WIP] eventlogging: add systemd support [puppet] - 10https://gerrit.wikimedia.org/r/413362 [19:08:21] (03PS1) 10Arturo Borrero Gonzalez: toollabs: apt_pinning: extend linux pinning in jessie [puppet] - 10https://gerrit.wikimedia.org/r/413776 (https://phabricator.wikimedia.org/T187193) [19:10:00] (03PS2) 10Arturo Borrero Gonzalez: toollabs: apt_pinning: extend linux pinning in jessie [puppet] - 10https://gerrit.wikimedia.org/r/413776 (https://phabricator.wikimedia.org/T187193) [19:10:38] (03CR) 10Arturo Borrero Gonzalez: [C: 032] toollabs: apt_pinning: extend linux pinning in jessie [puppet] - 10https://gerrit.wikimedia.org/r/413776 (https://phabricator.wikimedia.org/T187193) (owner: 10Arturo Borrero Gonzalez) [19:10:59] !log restart relforge elasticsearch cluster to test entity extraction on larger dataest [19:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:36] PROBLEM - ElasticSearch health check for shards on relforge1001 is CRITICAL: CRITICAL - elasticsearch http://10.64.4.13:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.4.13, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7f9b49283b90: Failed to establish a new connection: [Errno 111] Connection ref [19:13:56] PROBLEM - ElasticSearch health check for shards on relforge1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 81 threshold =0.1 breach: status: red, number_of_nodes: 1, unassigned_shards: 81, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 85, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 51.204 [19:13:56] rds: 85, initializing_shards: 0, number_of_data_nodes: 1, delayed_unassigned_shards: 0 [19:14:16] PROBLEM - Check systemd state on relforge1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:14:38] 10Operations, 10Security-Team, 10Security: Password Vault for Security Team - https://phabricator.wikimedia.org/T185236#3910431 (10JBennett) Couple questions: *Is each 'group' required to have their own repo? How is that access and credential sharing determined? *If that's not the case, how will we prevent... [19:15:16] RECOVERY - Check systemd state on relforge1001 is OK: OK - running: The system is fully operational [19:16:23] (03PS1) 10Rush: openstack: monitor nova-scheduler as critical [puppet] - 10https://gerrit.wikimedia.org/r/413778 (https://phabricator.wikimedia.org/T178405) [19:16:36] RECOVERY - ElasticSearch health check for shards on relforge1001 is OK: OK - elasticsearch status relforge-eqiad: status: green, number_of_nodes: 2, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 156, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_shards: 169, initial [19:16:36] mber_of_data_nodes: 2, delayed_unassigned_shards: 0 [19:16:53] (03CR) 10jerkins-bot: [V: 04-1] openstack: monitor nova-scheduler as critical [puppet] - 10https://gerrit.wikimedia.org/r/413778 (https://phabricator.wikimedia.org/T178405) (owner: 10Rush) [19:16:56] RECOVERY - ElasticSearch health check for shards on relforge1002 is OK: OK - elasticsearch status relforge-eqiad: status: green, number_of_nodes: 2, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 156, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_shards: 169, initial [19:16:56] mber_of_data_nodes: 2, delayed_unassigned_shards: 0 [19:17:50] (03PS2) 10Rush: openstack: monitor nova-scheduler as critical [puppet] - 10https://gerrit.wikimedia.org/r/413778 (https://phabricator.wikimedia.org/T178405) [19:19:58] (03PS3) 10Rush: openstack: monitor nova-scheduler as critical [puppet] - 10https://gerrit.wikimedia.org/r/413778 (https://phabricator.wikimedia.org/T178405) [19:22:10] (03PS1) 10Arturo Borrero Gonzalez: toollabs: apt_pinning: extend pinnigs for pam libs [puppet] - 10https://gerrit.wikimedia.org/r/413780 (https://phabricator.wikimedia.org/T187193) [19:28:00] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for katielin (katie) - https://phabricator.wikimedia.org/T187623#3981063 (10Jgreen) I did phone verification for the SSH key Katie provided on this task, with... [19:33:34] (03PS1) 10Chico Venancio: shinken: WMCS: add load alerts for tools-bastion-0[23] [puppet] - 10https://gerrit.wikimedia.org/r/413781 [19:33:56] (03CR) 10jerkins-bot: [V: 04-1] shinken: WMCS: add load alerts for tools-bastion-0[23] [puppet] - 10https://gerrit.wikimedia.org/r/413781 (owner: 10Chico Venancio) [19:34:34] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Onboard bstorm to WMF - https://phabricator.wikimedia.org/T185493#3997354 (10chasemp) [19:38:29] (03PS4) 10Rush: openstack: monitor nova-scheduler as critical [puppet] - 10https://gerrit.wikimedia.org/r/413778 (https://phabricator.wikimedia.org/T178405) [19:39:18] (03CR) 10Rush: [C: 032] openstack: monitor nova-scheduler as critical [puppet] - 10https://gerrit.wikimedia.org/r/413778 (https://phabricator.wikimedia.org/T178405) (owner: 10Rush) [19:43:26] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for katielin (katie) - https://phabricator.wikimedia.org/T187623#3997369 (10RStallman-legalteam) Sorry for not updating the spreadsheet sooner! Yes, the Acknow... [19:43:52] (03PS1) 10Chad: Shuffle group0 wikis a bit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413784 [19:45:03] (03PS2) 10Chad: Shuffle group0 wikis a bit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413784 [19:45:54] (03PS3) 10Chad: Shuffle group0 wikis a bit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413784 [19:47:35] (03CR) 10jerkins-bot: [V: 04-1] Shuffle group0 wikis a bit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413784 (owner: 10Chad) [19:48:21] (03CR) 10Rush: [C: 031] "looks right" [puppet] - 10https://gerrit.wikimedia.org/r/413631 (owner: 10Madhuvishy) [19:48:53] (03Draft1) 10Paladox: Gerrit: Move reviewers.config to modules/gerrit/files/etc/reviewers.config [puppet] - 10https://gerrit.wikimedia.org/r/413785 [19:48:56] (03PS2) 10Paladox: Gerrit: Move reviewers.config to modules/gerrit/files/etc/reviewers.config [puppet] - 10https://gerrit.wikimedia.org/r/413785 [19:50:12] (03PS4) 10Madhuvishy: nfs traffic_shaping: Add labstore1006|7 to tc setup [puppet] - 10https://gerrit.wikimedia.org/r/413631 [19:53:26] (03PS2) 10Chico Venancio: shinken: WMCS: add load alerts for tools-bastion-0[23] [puppet] - 10https://gerrit.wikimedia.org/r/413781 (https://phabricator.wikimedia.org/T186552) [20:05:59] PROBLEM - puppet last run on rdb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:07:22] (03PS2) 1020after4: scap sync-canary plugin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413640 [20:07:28] (03PS4) 10Chad: Shuffle group0 wikis a bit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413784 [20:08:13] (03PS3) 1020after4: scap sync-canary plugin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413640 [20:12:03] (03PS1) 10Mholloway: Update zerofetch script to report login failure reason [puppet] - 10https://gerrit.wikimedia.org/r/413787 (https://phabricator.wikimedia.org/T188111) [20:14:30] (03PS1) 10Rush: icinga: wmcs-team set rush contact [puppet] - 10https://gerrit.wikimedia.org/r/413788 (https://phabricator.wikimedia.org/T178405) [20:15:09] (03PS2) 10Rush: icinga: wmcs-team set rush contact [puppet] - 10https://gerrit.wikimedia.org/r/413788 (https://phabricator.wikimedia.org/T178405) [20:16:05] (03CR) 10Rush: [C: 032] icinga: wmcs-team set rush contact [puppet] - 10https://gerrit.wikimedia.org/r/413788 (https://phabricator.wikimedia.org/T178405) (owner: 10Rush) [20:19:10] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for katielin (katie) - https://phabricator.wikimedia.org/T187623#3997472 (10RobH) a:05RStallman-legalteam>03MeganHernandez_WMF So the only thing we lack is... [20:19:30] 10Operations, 10Thumbor, 10Patch-For-Review, 10Performance-Team (Radar), 10User-fgiunchedi: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817#3997475 (10brion) [20:20:07] (03CR) 10Chad: [C: 032] group1 to wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413660 (owner: 10Chad) [20:20:15] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for katielin (katie) - https://phabricator.wikimedia.org/T187623#3997476 (10RobH) 05stalled>03Open [20:20:19] (03PS1) 10Papaul: DNS: Add mgmt and production entries for wdqs200[4-6] [dns] - 10https://gerrit.wikimedia.org/r/413791 [20:20:32] (03CR) 10jerkins-bot: [V: 04-1] DNS: Add mgmt and production entries for wdqs200[4-6] [dns] - 10https://gerrit.wikimedia.org/r/413791 (owner: 10Papaul) [20:21:46] (03Merged) 10jenkins-bot: group1 to wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413660 (owner: 10Chad) [20:21:55] (03CR) 10jenkins-bot: group1 to wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413660 (owner: 10Chad) [20:22:19] 10Operations, 10Traffic, 10ZeroPortal, 10Patch-For-Review: Cannot fetch Zero carriers/proxies JSON files from eqsin - https://phabricator.wikimedia.org/T188111#3997484 (10Mholloway) >>! In T188111#3996911, @BBlack wrote: > I looked into this a little bit, and while I do see there's a deprecation warning is... [20:24:06] about to fire a nova instance creation alert [20:24:46] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [20:25:07] (03PS1) 10Ottomata: Add duplicate mediawiki avro Camus job to consume from Kafka jumbo and analytics [puppet] - 10https://gerrit.wikimedia.org/r/413792 (https://phabricator.wikimedia.org/T188136) [20:25:46] (03CR) 10jerkins-bot: [V: 04-1] Add duplicate mediawiki avro Camus job to consume from Kafka jumbo and analytics [puppet] - 10https://gerrit.wikimedia.org/r/413792 (https://phabricator.wikimedia.org/T188136) (owner: 10Ottomata) [20:26:16] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [20:28:30] (03PS1) 10Rush: openstack: nova-fullstack test alert on 2 tries [puppet] - 10https://gerrit.wikimedia.org/r/413793 (https://phabricator.wikimedia.org/T178405) [20:29:35] (03CR) 10Rush: [C: 032] openstack: nova-fullstack test alert on 2 tries [puppet] - 10https://gerrit.wikimedia.org/r/413793 (https://phabricator.wikimedia.org/T178405) (owner: 10Rush) [20:34:05] (03PS1) 10Ottomata: Point Mediawiki Monolog at Kafka jumbo in deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413795 (https://phabricator.wikimedia.org/T188136) [20:34:16] (03PS2) 10BBlack: Update zerofetch script to report login failure reason [puppet] - 10https://gerrit.wikimedia.org/r/413787 (https://phabricator.wikimedia.org/T188111) (owner: 10Mholloway) [20:34:42] (03PS1) 10Ottomata: Point Mediawiki Monolog at Kafka jumbo in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413796 (https://phabricator.wikimedia.org/T188136) [20:35:07] (03CR) 10BBlack: [C: 032] Update zerofetch script to report login failure reason [puppet] - 10https://gerrit.wikimedia.org/r/413787 (https://phabricator.wikimedia.org/T188111) (owner: 10Mholloway) [20:35:45] !log demon@tin rebuilt and synchronized wikiversions files: group1 to wmf.22 [20:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:05] RECOVERY - puppet last run on rdb1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:38:24] !log demon@tin rebuilt and synchronized wikiversions files: roll wikidatawiki back to wmf.11, busted [20:38:35] PROBLEM - haproxy failover on dbproxy1005 is CRITICAL: CRITICAL check_failover servers up 2 down 1 [20:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:55] (03PS1) 10Chad: wikidatawiki back to wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413797 [20:38:57] (03CR) 10Chad: [C: 032] wikidatawiki back to wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413797 (owner: 10Chad) [20:39:07] !log wmf.21, that is [20:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:08] (03Merged) 10jenkins-bot: wikidatawiki back to wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413797 (owner: 10Chad) [20:40:18] (03CR) 10jenkins-bot: wikidatawiki back to wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413797 (owner: 10Chad) [20:41:51] 10Operations, 10Traffic, 10ZeroPortal, 10Patch-For-Review: Cannot fetch Zero carriers/proxies JSON files from eqsin - https://phabricator.wikimedia.org/T188111#3997503 (10BBlack) Merged your patch (thanks). New failure in eqsin is: `Exception: API login phase2 gave result Failed with reason "Incorrect us... [20:42:18] (03CR) 10EBernhardson: [C: 031] "seems reasonable" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413795 (https://phabricator.wikimedia.org/T188136) (owner: 10Ottomata) [20:44:49] (03CR) 10Ottomata: [C: 032] Point Mediawiki Monolog at Kafka jumbo in deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413795 (https://phabricator.wikimedia.org/T188136) (owner: 10Ottomata) [20:46:04] (03PS2) 10Ottomata: Add duplicate mediawiki avro Camus job to consume from Kafka jumbo and analytics [puppet] - 10https://gerrit.wikimedia.org/r/413792 (https://phabricator.wikimedia.org/T188136) [20:46:55] (03PS1) 10Chad: Move group2 to wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413798 [20:47:32] (03CR) 10jenkins-bot: Point Mediawiki Monolog at Kafka jumbo in deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413795 (https://phabricator.wikimedia.org/T188136) (owner: 10Ottomata) [20:48:16] !log demon@tin rebuilt and synchronized wikiversions files: group2 to wmf.22 [20:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:16] no_justification: mind syncing a labs no-op as well? or i can after you're done: https://gerrit.wikimedia.org/r/413795 [20:49:24] Please wait [20:54:41] (03PS1) 10Rush: toolforge: set alerting for tools.checker things [puppet] - 10https://gerrit.wikimedia.org/r/413800 (https://phabricator.wikimedia.org/T178405) [20:56:59] (03PS2) 10Rush: toolforge: set alerting for tools.checker things [puppet] - 10https://gerrit.wikimedia.org/r/413800 (https://phabricator.wikimedia.org/T178405) [20:57:45] PROBLEM - ElasticSearch health check for shards on relforge1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 160 threshold =0.1 breach: status: red, number_of_nodes: 2, unassigned_shards: 153, number_of_pending_tasks: 8, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 15, task_max_waiting_in_queue_millis: 12268, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: [20:57:45] ve_shards: 16, initializing_shards: 7, number_of_data_nodes: 2, delayed_unassigned_shards: 0 [20:58:14] (03CR) 10Rush: [C: 032] toolforge: set alerting for tools.checker things [puppet] - 10https://gerrit.wikimedia.org/r/413800 (https://phabricator.wikimedia.org/T178405) (owner: 10Rush) [20:58:42] RECOVERY - ElasticSearch health check for shards on relforge1002 is OK: OK - elasticsearch status relforge-eqiad: status: green, number_of_nodes: 2, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 163, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_shards: 176, initial [20:58:42] mber_of_data_nodes: 2, delayed_unassigned_shards: 0 [20:59:40] 10Operations, 10Cloud-Services, 10Developer-Relations: Use the term "developer account" for Wikimedia LDAP accounts - https://phabricator.wikimedia.org/T179461#3997547 (10bd808) [21:00:08] 10Operations, 10Cloud-Services, 10Developer-Relations: Use the term "developer account" for Wikimedia LDAP accounts - https://phabricator.wikimedia.org/T179461#3725481 (10bd808) [21:00:19] (03PS1) 10Chad: Group2 to wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413801 [21:00:21] (03CR) 10Chad: [C: 032] Group2 to wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413801 (owner: 10Chad) [21:01:52] PROBLEM - ElasticSearch health check for shards on relforge1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 85 threshold =0.1 breach: status: red, number_of_nodes: 2, unassigned_shards: 81, number_of_pending_tasks: 7, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 91, task_max_waiting_in_queue_millis: 7983, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 51. [21:01:52] shards: 91, initializing_shards: 4, number_of_data_nodes: 2, delayed_unassigned_shards: 78 [21:01:53] PROBLEM - ElasticSearch health check for shards on relforge1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 81 threshold =0.1 breach: status: red, number_of_nodes: 2, unassigned_shards: 77, number_of_pending_tasks: 7, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 94, task_max_waiting_in_queue_millis: 11645, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 53 [21:01:53] _shards: 95, initializing_shards: 4, number_of_data_nodes: 2, delayed_unassigned_shards: 74 [21:01:53] (03Merged) 10jenkins-bot: Group2 to wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413801 (owner: 10Chad) [21:02:08] (03CR) 10jenkins-bot: Group2 to wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413801 (owner: 10Chad) [21:02:52] RECOVERY - ElasticSearch health check for shards on relforge1002 is OK: OK - elasticsearch status relforge-eqiad: status: green, number_of_nodes: 2, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 163, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_shards: 176, initial [21:02:52] mber_of_data_nodes: 2, delayed_unassigned_shards: 0 [21:02:53] RECOVERY - ElasticSearch health check for shards on relforge1001 is OK: OK - elasticsearch status relforge-eqiad: status: green, number_of_nodes: 2, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 163, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_shards: 176, initial [21:02:53] mber_of_data_nodes: 2, delayed_unassigned_shards: 0 [21:04:24] !log demon@tin Started scap: pos mysql code [21:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:41] (03PS1) 10Rush: toolforge: tools checker contact_groups add admins [puppet] - 10https://gerrit.wikimedia.org/r/413804 (https://phabricator.wikimedia.org/T178405) [21:06:15] (03CR) 10Rush: [C: 032] toolforge: tools checker contact_groups add admins [puppet] - 10https://gerrit.wikimedia.org/r/413804 (https://phabricator.wikimedia.org/T178405) (owner: 10Rush) [21:09:22] (03CR) 10Volans: [C: 031] "LGTM, one nitpick inline" (031 comment) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/413758 (owner: 10Muehlenhoff) [21:10:46] 10Operations, 10Security-Team, 10Security: Password Vault for Security Team - https://phabricator.wikimedia.org/T185236#3997601 (10demon) >>! In T185236#3997294, @JBennett wrote: > Couple questions: > *Is each 'group' required to have their own repo? How is that access and credential sharing determined? No... [21:12:05] (03CR) 10Chad: [C: 031] Gerrit: Move reviewers.config to modules/gerrit/files/etc/reviewers.config [puppet] - 10https://gerrit.wikimedia.org/r/413785 (owner: 10Paladox) [21:16:22] (03PS3) 10Dzahn: Gerrit: Move reviewers.config to modules/gerrit/files/etc/reviewers.config [puppet] - 10https://gerrit.wikimedia.org/r/413785 (owner: 10Paladox) [21:17:21] (03CR) 10Dzahn: [C: 032] Gerrit: Move reviewers.config to modules/gerrit/files/etc/reviewers.config [puppet] - 10https://gerrit.wikimedia.org/r/413785 (owner: 10Paladox) [21:17:25] thanks :) [21:19:00] deployed [21:19:11] (no restart) [21:19:21] thanks :) [21:22:34] (03PS2) 10Dzahn: Gerrit: Set plugin.webhooks.sslVerify = true [puppet] - 10https://gerrit.wikimedia.org/r/413661 (owner: 10Chad) [21:23:10] (03CR) 10Dzahn: [C: 032] "per "for the pending webhooks plugin we are planning to install so wont affect anything yet"" [puppet] - 10https://gerrit.wikimedia.org/r/413661 (owner: 10Chad) [21:27:34] !log demon@tin Finished scap: pos mysql code (duration: 23m 09s) [21:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:48] (03PS1) 10Andrew Bogott: role::striker::web: Update to work on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/413808 [21:30:28] (03PS2) 10Andrew Bogott: role::striker::web: Update to work on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/413808 [21:30:37] (03PS1) 10Dzahn: kafkamon: add IPv6 mapped IPs [puppet] - 10https://gerrit.wikimedia.org/r/413809 (https://phabricator.wikimedia.org/T187805) [21:31:09] (03CR) 10jerkins-bot: [V: 04-1] kafkamon: add IPv6 mapped IPs [puppet] - 10https://gerrit.wikimedia.org/r/413809 (https://phabricator.wikimedia.org/T187805) (owner: 10Dzahn) [21:31:39] (03PS1) 10EBernhardson: Increas bulk insert threadpool for relforge [puppet] - 10https://gerrit.wikimedia.org/r/413810 [21:32:03] (03PS1) 10Chad: wikidatawiki to wmf.22 (already there, just committing) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413811 [21:32:05] (03CR) 10Chad: [C: 032] wikidatawiki to wmf.22 (already there, just committing) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413811 (owner: 10Chad) [21:33:02] (03CR) 10Dzahn: [C: 032] "sorry jerkins-bot, must override until new role is merged" [puppet] - 10https://gerrit.wikimedia.org/r/413809 (https://phabricator.wikimedia.org/T187805) (owner: 10Dzahn) [21:33:14] (03CR) 10Dzahn: [V: 032 C: 032] kafkamon: add IPv6 mapped IPs [puppet] - 10https://gerrit.wikimedia.org/r/413809 (https://phabricator.wikimedia.org/T187805) (owner: 10Dzahn) [21:33:20] (03PS2) 10Dzahn: kafkamon: add IPv6 mapped IPs [puppet] - 10https://gerrit.wikimedia.org/r/413809 (https://phabricator.wikimedia.org/T187805) [21:33:32] (03Merged) 10jenkins-bot: wikidatawiki to wmf.22 (already there, just committing) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413811 (owner: 10Chad) [21:33:39] (03CR) 10Andrew Bogott: [C: 032] role::striker::web: Update to work on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/413808 (owner: 10Andrew Bogott) [21:33:48] (03PS2) 10EBernhardson: Increas bulk insert threadpool for relforge [puppet] - 10https://gerrit.wikimedia.org/r/413810 [21:33:51] (03CR) 10jerkins-bot: [V: 04-1] kafkamon: add IPv6 mapped IPs [puppet] - 10https://gerrit.wikimedia.org/r/413809 (https://phabricator.wikimedia.org/T187805) (owner: 10Dzahn) [21:34:03] (03PS3) 10EBernhardson: Increas bulk insert threadpool for relforge [puppet] - 10https://gerrit.wikimedia.org/r/413810 [21:34:06] (03CR) 10Dzahn: [V: 032 C: 032] kafkamon: add IPv6 mapped IPs [puppet] - 10https://gerrit.wikimedia.org/r/413809 (https://phabricator.wikimedia.org/T187805) (owner: 10Dzahn) [21:34:31] (03PS3) 10Dzahn: kafkamon: add IPv6 mapped IPs [puppet] - 10https://gerrit.wikimedia.org/r/413809 (https://phabricator.wikimedia.org/T187805) [21:35:05] (03CR) 10jerkins-bot: [V: 04-1] kafkamon: add IPv6 mapped IPs [puppet] - 10https://gerrit.wikimedia.org/r/413809 (https://phabricator.wikimedia.org/T187805) (owner: 10Dzahn) [21:36:12] (03CR) 10Dzahn: [V: 032 C: 032] kafkamon: add IPv6 mapped IPs [puppet] - 10https://gerrit.wikimedia.org/r/413809 (https://phabricator.wikimedia.org/T187805) (owner: 10Dzahn) [21:37:03] (03CR) 10jenkins-bot: wikidatawiki to wmf.22 (already there, just committing) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413811 (owner: 10Chad) [21:38:04] PROBLEM - ElasticSearch health check for shards on relforge1002 is CRITICAL: CRITICAL - elasticsearch http://10.64.37.21:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.37.21, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7ff969813b90: Failed to establish a new connection: [Errno 111] Connection r [21:38:24] PROBLEM - ElasticSearch health check for shards on relforge1001 is CRITICAL: CRITICAL - elasticsearch http://10.64.4.13:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.4.13, port=9200): Read timed out. (read timeout=4) [21:39:12] ebernhardson: ^ seems related to what you are uploading? [21:39:27] mutante: yes this is just me experimenting with the cluster [21:39:31] ok :) [21:40:04] RECOVERY - ElasticSearch health check for shards on relforge1002 is OK: OK - elasticsearch status relforge-eqiad: status: green, number_of_nodes: 2, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 163, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_shards: 176, initial [21:40:04] mber_of_data_nodes: 2, delayed_unassigned_shards: 0 [21:40:15] RECOVERY - ElasticSearch health check for shards on relforge1001 is OK: OK - elasticsearch status relforge-eqiad: status: green, number_of_nodes: 2, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 163, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_shards: 176, initial [21:40:15] mber_of_data_nodes: 2, delayed_unassigned_shards: 0 [21:41:59] (03PS1) 10Dzahn: kafkamon: add IPv6 records (WIP) [dns] - 10https://gerrit.wikimedia.org/r/413813 [21:42:01] (03CR) 10jerkins-bot: [V: 04-1] kafkamon: add IPv6 records (WIP) [dns] - 10https://gerrit.wikimedia.org/r/413813 (owner: 10Dzahn) [21:43:16] (03PS4) 10Paladox: puppetmaster: Use ruby-mysql2 over ruby-mysql and migrate servermon to it [puppet] - 10https://gerrit.wikimedia.org/r/391336 [21:43:24] (03PS5) 10Paladox: puppetmaster: Use ruby-mysql2 over ruby-mysql and migrate servermon to it [puppet] - 10https://gerrit.wikimedia.org/r/391336 [21:45:40] (03CR) 10Paladox: "> No, it is still being used by https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/puppetmaster/lib/pupp" [puppet] - 10https://gerrit.wikimedia.org/r/391336 (owner: 10Paladox) [21:46:13] (03PS6) 10Paladox: puppetmaster: Use ruby-mysql2 over ruby-mysql and migrate servermon to it [puppet] - 10https://gerrit.wikimedia.org/r/391336 (https://phabricator.wikimedia.org/T184562) [21:46:15] (03PS1) 10Madhuvishy: toolschecker: Add the toolschecker word to alert msgs [puppet] - 10https://gerrit.wikimedia.org/r/413814 [21:46:58] 10Operations, 10Puppet, 10Patch-For-Review: Upgrade Puppet Master Infrastructure to Debian Stretch - https://phabricator.wikimedia.org/T184562#3997712 (10Paladox) @fgiunchedi hi, i've migrated servermon.rb to use ruby-mysql2 here https://gerrit.wikimedia.org/r/#/c/391336/ [21:48:18] (03CR) 10Rush: [C: 031] toolschecker: Add the toolschecker word to alert msgs [puppet] - 10https://gerrit.wikimedia.org/r/413814 (owner: 10Madhuvishy) [21:48:37] (03CR) 10Madhuvishy: [C: 032] toolschecker: Add the toolschecker word to alert msgs [puppet] - 10https://gerrit.wikimedia.org/r/413814 (owner: 10Madhuvishy) [21:54:40] PROBLEM - puppet last run on californium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:10:24] (03PS5) 10Madhuvishy: nfs traffic_shaping: Add labstore1006|7 to tc setup [puppet] - 10https://gerrit.wikimedia.org/r/413631 [22:11:00] (03CR) 10Madhuvishy: [C: 032] nfs traffic_shaping: Add labstore1006|7 to tc setup [puppet] - 10https://gerrit.wikimedia.org/r/413631 (owner: 10Madhuvishy) [22:12:20] ^ andrewbogott ping just because I know striker things are in the mix [22:13:21] huh, I will look [22:13:21] thanks [22:15:13] (03PS1) 10MaxSem: beta: enable mentions in edit summaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413879 (https://phabricator.wikimedia.org/T187835) [22:16:08] (03PS1) 10Andrew Bogott: role::striker::web: change distro behavior [puppet] - 10https://gerrit.wikimedia.org/r/413880 [22:17:06] (03CR) 10Andrew Bogott: [C: 032] role::striker::web: change distro behavior [puppet] - 10https://gerrit.wikimedia.org/r/413880 (owner: 10Andrew Bogott) [22:19:40] RECOVERY - puppet last run on californium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:30:07] (03Abandoned) 10Chad: Gerrit: Upgrading gerrit to 2.14.6-pre (DO NOT MERGE) [software/gerrit] - 10https://gerrit.wikimedia.org/r/363734 (owner: 10Paladox) [22:30:40] (03Abandoned) 10Chad: gerrit: Ajust scap files (DO NOT MERGE) [software/gerrit] - 10https://gerrit.wikimedia.org/r/363738 (owner: 10Paladox) [22:31:03] no_justification branch based work? [22:31:07] ie stable-2.14 [22:31:09] paladox: I redid our branch structure for ops/software/gerrit [22:31:10] and .... [22:31:13] oh [22:31:17] https://gerrit.wikimedia.org/r/#/admin/projects/operations/software/gerrit,branches [22:31:35] So we can do stable-2.15 for testing, stable-2.14 for production, etc. [22:31:41] And HEAD points to current production one [22:31:48] :) [22:33:38] (03PS1) 10Herron: WIP: puppet_compiler: add support for puppetdb4 and local postgresql [puppet] - 10https://gerrit.wikimedia.org/r/413881 [22:33:55] no_justification nice :) [22:34:07] that will make things easier [22:34:21] (03CR) 10jerkins-bot: [V: 04-1] WIP: puppet_compiler: add support for puppetdb4 and local postgresql [puppet] - 10https://gerrit.wikimedia.org/r/413881 (owner: 10Herron) [22:34:37] and you added the webhooks plugin too :) [22:34:58] (03PS1) 10Chad: Remove +x from webhooks plugin [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/413882 [22:35:00] (03CR) 10Chad: [C: 032] Remove +x from webhooks plugin [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/413882 (owner: 10Chad) [22:35:08] (03CR) 10Chad: [V: 032 C: 032] Remove +x from webhooks plugin [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/413882 (owner: 10Chad) [22:35:47] !log demon@tin Started deploy [gerrit/gerrit@010ad50]: no-op, removing permission from file [22:35:57] !log demon@tin Finished deploy [gerrit/gerrit@010ad50]: no-op, removing permission from file (duration: 00m 10s) [22:36:02] paladox: I did that a few days ago [22:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:10] I meant to mention that to Lego ;-) [22:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:16] no_justification heh [22:36:37] neat :D [22:36:41] no_justification do you merge 2.14 into 2.15? [22:36:48] except...I don't need the plugin anymore due to gitiles :p [22:36:49] until we do the 2.15 war? [22:36:50] I branched 2.15 from the same spot [22:37:10] I thought you needed it for some packagist thing [22:37:14] heh, i guess you could just delete and recreate the branch until we need to do 2.15 testing. [22:37:15] To trigger updates? [22:37:16] Or something [22:37:33] Oh true re: stable-2.15 [22:37:39] Otherwise it'll be outta date [22:37:48] oh, that would be useful for that [22:37:53] it would let us stop depending upon github [22:38:18] yeh [22:39:15] (03PS2) 10Dzahn: kafkamon: add IPv6 records [dns] - 10https://gerrit.wikimedia.org/r/413813 [22:39:23] (03CR) 10jerkins-bot: [V: 04-1] kafkamon: add IPv6 records [dns] - 10https://gerrit.wikimedia.org/r/413813 (owner: 10Dzahn) [22:39:28] no_justification i guess i should being the 2.15 testing on gerrit-test3 once 2.15 has a s table release. (re notedb too). [22:39:58] (03PS3) 10Dzahn: kafkamon: add IPv6 records [dns] - 10https://gerrit.wikimedia.org/r/413813 [22:41:26] (03PS1) 10Andrew Bogott: wikitech: grants for the new labswiki db on m5 [puppet] - 10https://gerrit.wikimedia.org/r/413884 (https://phabricator.wikimedia.org/T188029) [22:41:49] (03PS4) 10Dzahn: kafkamon: add IPv6 records [dns] - 10https://gerrit.wikimedia.org/r/413813 (https://phabricator.wikimedia.org/T187805) [22:42:34] (03CR) 10Andrew Bogott: "This can't be applied until the users and db exist, of course." [puppet] - 10https://gerrit.wikimedia.org/r/413884 (https://phabricator.wikimedia.org/T188029) (owner: 10Andrew Bogott) [22:43:10] (03CR) 10Dzahn: [C: 032] kafkamon: add IPv6 records [dns] - 10https://gerrit.wikimedia.org/r/413813 (https://phabricator.wikimedia.org/T187805) (owner: 10Dzahn) [22:44:10] paladox: We could build rc.3 and various other plugins against stable-2.15 [22:44:14] But tbh I'd rather it go final [22:44:29] no_justification yeh. [22:44:58] no_justification seems to be a few bugs popped up in it [22:45:27] https://bugs.chromium.org/p/gerrit/issues/detail?id=8439 https://bugs.chromium.org/p/gerrit/issues/detail?id=8439 [22:45:41] no_justification https://bugs.chromium.org/p/gerrit/issues/detail?id=8440 was fixed on master but never fixed in 2.15 heh. [22:45:51] (03CR) 10Dzahn: [C: 032] "[kafkamon1001:~] $ ping6 -c1 kafkamon2001.codfw.wmnet" [dns] - 10https://gerrit.wikimedia.org/r/413813 (https://phabricator.wikimedia.org/T187805) (owner: 10Dzahn) [22:46:16] no_justification lol, googles ment to be maintaining it in 2.15, but i have probaly backported like most of the fixes / ui changes in to the branch. [22:49:30] (03PS1) 10Chad: Add GO plugin for gerrit [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/413888 [22:49:46] well i finally got the gpg screen implemented in polygerrit https://gerrit-review.googlesource.com/c/gerrit/+/137871 [22:50:11] no_justification oh what does the go plugin do? [22:50:14] * paladox has a look [22:50:34] (03PS1) 10Dzahn: network::constants: add kafkamon servers [puppet] - 10https://gerrit.wikimedia.org/r/413889 (https://phabricator.wikimedia.org/T187805) [22:51:31] paladox: Lets you use `go get` and such [22:51:41] ah i see [22:51:52] thanks [22:52:26] * paladox removes buck from it in https://gerrit-review.googlesource.com/c/plugins/go-import/+/162032 [22:52:58] (03CR) 10Dzahn: "I think it would be nicer if you do this first: https://gerrit.wikimedia.org/r/413889 and then use $KAFKA_MONITORS here like the other v" [puppet] - 10https://gerrit.wikimedia.org/r/413685 (https://phabricator.wikimedia.org/T187805) (owner: 10Elukey) [22:53:08] (03Abandoned) 10Chad: Move group2 to wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413798 (owner: 10Chad) [22:53:16] (03CR) 10Paladox: [C: 031] Add GO plugin for gerrit [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/413888 (owner: 10Chad) [22:53:58] no_justification i will add that plugin to gerrit's review ci, seems to be missing. [22:54:03] (03CR) 10Dzahn: "to be used by https://gerrit.wikimedia.org/r/#/c/413685/" [puppet] - 10https://gerrit.wikimedia.org/r/413889 (https://phabricator.wikimedia.org/T187805) (owner: 10Dzahn) [22:56:43] (03PS2) 10Dzahn: DNS: Add mgmt and production entries for wdqs200[4-6] [dns] - 10https://gerrit.wikimedia.org/r/413791 (owner: 10Papaul) [22:56:55] (03CR) 10jerkins-bot: [V: 04-1] DNS: Add mgmt and production entries for wdqs200[4-6] [dns] - 10https://gerrit.wikimedia.org/r/413791 (owner: 10Papaul) [22:59:05] 10Operations, 10Traffic, 10ZeroPortal, 10Patch-For-Review: Cannot fetch Zero carriers/proxies JSON files from eqsin - https://phabricator.wikimedia.org/T188111#3998054 (10BBlack) I've tested setting the `HTTPS_PROXY` environment variable before a manual script run from eqsin, causing the request to be prox... [23:01:07] (03PS3) 10Dzahn: DNS: Add mgmt and production entries for wdqs200[4-6] [dns] - 10https://gerrit.wikimedia.org/r/413791 (owner: 10Papaul) [23:12:49] 10Operations, 10Traffic, 10Zero, 10ZeroPortal, 10Patch-For-Review: Cannot fetch Zero carriers/proxies JSON files from eqsin - https://phabricator.wikimedia.org/T188111#3998106 (10Mholloway) [23:12:58] 10Operations, 10Traffic, 10Zero, 10ZeroPortal: Cannot fetch Zero carriers/proxies JSON files from eqsin - https://phabricator.wikimedia.org/T188111#3996633 (10Mholloway) [23:27:03] (03CR) 10Dzahn: [C: 032] DNS: Add mgmt and production entries for wdqs200[4-6] [dns] - 10https://gerrit.wikimedia.org/r/413791 (owner: 10Papaul) [23:29:04] (03CR) 10Dzahn: [C: 032] "[radon:~] $ host wdqs2004.codfw.wmnet" [dns] - 10https://gerrit.wikimedia.org/r/413791 (owner: 10Papaul)