[00:09:55] 10Operations, 10ops-eqiad, 10ops-ulsfo, 10DC-Ops: Audit down ports - https://phabricator.wikimedia.org/T218751 (10ayounsi) p:05Triage→03Low [00:10:16] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Define an SLO for Wikidata Query Service public endpoint and communicate it - https://phabricator.wikimedia.org/T199228 (10Smalyshev) 05Open→03Stalled [00:11:50] 10Operations, 10ops-eqiad, 10ops-ulsfo, 10DC-Ops: Audit down ports - https://phabricator.wikimedia.org/T218751 (10ayounsi) `name=asw2-b-eqiad,lang=diff [edit interfaces xe-7/0/4] + disable; ` [00:12:04] 10Operations, 10ops-eqiad, 10ops-ulsfo, 10DC-Ops: Audit down ports - https://phabricator.wikimedia.org/T218751 (10ayounsi) [00:14:19] (03PS1) 10Smalyshev: Moving categories dump storage to dumps/ [puppet] - 10https://gerrit.wikimedia.org/r/497662 (https://phabricator.wikimedia.org/T218457) [00:16:01] doing one last-minute swat [00:19:37] thcipriani, https://gerrit.wikimedia.org/r/#/c/497433/ ? [00:22:05] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Socket timeout on wdqs.svc.eqiad.wmnet - https://phabricator.wikimedia.org/T217557 (10Smalyshev) Not sure what is to be done for this task. Do we want to investigate what caused it (i.e. which queries, why socket... [00:25:25] Ebe123: sorry, AndyRussG poked me to get one change out and that's all I have the energy for during this window. [00:25:38] maybe more than I have the energy for, really :) [00:25:54] Oh well [00:35:48] AndyRussG: change is live on mwdebug1002, check please [00:39:11] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:40:19] RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 74801 bytes in 0.622 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:48:21] !log thcipriani@deploy1001 Synchronized php-1.33.0-wmf.21/includes/Title.php: SWAT: [[gerrit:497649|Improve Caching in Title::loadRestrictions()]] (duration: 00m 51s) [00:48:23] thcipriani@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [00:48:35] ^ AndyRussG live now [00:48:58] thcipriani: yay! [00:50:03] thcipriani: looks good to me! [00:50:09] AndyRussG: great! [00:51:59] PROBLEM - puppet last run on aqs1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:53:47] thcipriani: hmmm there happen to be a few errors in logstash that mention the class that changed, right around when it went live: https://logstash.wikimedia.org/goto/5089c5749389a6b4b793dabd275b3d63 [00:53:56] I don't see any more though [00:54:15] probably just a coincidence [00:55:16] memcached performance looks fine [00:56:30] that's good [00:57:35] thcipriani: yeah again no further errors mentioning Title.php [00:57:39] so all good I think [00:57:51] I'll stay nearby in case anything comes up [00:58:05] thanks so much once again!!! [00:58:59] sure thing [01:00:06] (03PS1) 10Alex Monk: [WIP] acme-chief: Add script for Designate integration [puppet] - 10https://gerrit.wikimedia.org/r/497670 (https://phabricator.wikimedia.org/T206922) [01:01:37] (03CR) 10jerkins-bot: [V: 04-1] [WIP] acme-chief: Add script for Designate integration [puppet] - 10https://gerrit.wikimedia.org/r/497670 (https://phabricator.wikimedia.org/T206922) (owner: 10Alex Monk) [01:08:29] (03PS1) 10Bstorm: Revert "wiki replicas: depool labsdb1010 for comment refactor changes" [puppet] - 10https://gerrit.wikimedia.org/r/497672 [01:17:19] (03PS2) 10Alex Monk: [WIP] acme-chief: Add script for Designate integration [puppet] - 10https://gerrit.wikimedia.org/r/497670 (https://phabricator.wikimedia.org/T206922) [01:18:23] RECOVERY - puppet last run on aqs1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [01:24:38] (03CR) 10Bstorm: [C: 03+2] Revert "wiki replicas: depool labsdb1010 for comment refactor changes" [puppet] - 10https://gerrit.wikimedia.org/r/497672 (owner: 10Bstorm) [01:33:39] (03PS1) 10Bstorm: wiki replicas: depool labsdb1011 for comment refactor changes [puppet] - 10https://gerrit.wikimedia.org/r/497674 (https://phabricator.wikimedia.org/T212972) [01:35:05] (03CR) 10Bstorm: [C: 03+2] wiki replicas: depool labsdb1011 for comment refactor changes [puppet] - 10https://gerrit.wikimedia.org/r/497674 (https://phabricator.wikimedia.org/T212972) (owner: 10Bstorm) [01:40:31] 10Operations, 10MobileFrontend, 10TechCom-RFC, 10Traffic, 10Readers-Web-Backlog (Tracking): Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Krinkle) >>! In T214998#5029596, @dbarratt wrote: > Do we know how much of a burden it wi... [01:43:41] Krinkle, I don't think you meant the IsMobile part in en.wp.o handling [01:43:54] Krenair: I did. It's a variable. [01:44:37] implicitly defaulting to false? [01:46:23] Krenair: Right [01:47:04] Krenair: I've rephrased slightly by saying "Set to true" instead of "setting" [01:47:26] sure though I was looking at the non-mobile one [01:49:49] Wait, we have both ShortUrl and UrlShortener in production? [01:51:54] oh, one is for title-based and one is for arbitrary urls. [01:52:00] *face palms* [01:57:31] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:57:47] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:58:21] PROBLEM - Request latencies on acrab is CRITICAL: instance=10.192.16.26:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:58:25] PROBLEM - Request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:29:33] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:30:05] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:32:07] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 126, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:32:39] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:48:51] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:49:23] RECOVERY - Request latencies on acrab is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:49:27] RECOVERY - Request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:49:47] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:53:16] (03PS1) 10Bstorm: Revert "wiki replicas: depool labsdb1011 for comment refactor changes" [puppet] - 10https://gerrit.wikimedia.org/r/497678 [02:53:43] (03CR) 10Bstorm: [C: 03+2] Revert "wiki replicas: depool labsdb1011 for comment refactor changes" [puppet] - 10https://gerrit.wikimedia.org/r/497678 (owner: 10Bstorm) [03:11:08] (03PS1) 10Bstorm: wiki replicas: depool labsdb1009 for views changes [puppet] - 10https://gerrit.wikimedia.org/r/497679 (https://phabricator.wikimedia.org/T212972) [03:12:20] (03CR) 10Bstorm: [C: 03+2] wiki replicas: depool labsdb1009 for views changes [puppet] - 10https://gerrit.wikimedia.org/r/497679 (https://phabricator.wikimedia.org/T212972) (owner: 10Bstorm) [03:19:22] volans: yeah, not sure what was different this time, but CGI was happily serving data (meta-monitoring thought so and I even telnetted to localhost and did some fetches) [03:26:27] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1020 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1 [04:36:19] PROBLEM - puppet last run on puppetmaster2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:39:01] (03PS1) 10Ammarpad: Wikimaniawiki: Enable visual editor in 2019 namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497682 (https://phabricator.wikimedia.org/T218645) [04:40:14] (03PS1) 10Bstorm: Revert "wiki replicas: depool labsdb1009 for views changes" [puppet] - 10https://gerrit.wikimedia.org/r/497683 [04:41:33] (03CR) 10Bstorm: [C: 03+2] Revert "wiki replicas: depool labsdb1009 for views changes" [puppet] - 10https://gerrit.wikimedia.org/r/497683 (owner: 10Bstorm) [04:56:13] PROBLEM - puppet last run on mw1246 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:02:39] RECOVERY - puppet last run on puppetmaster2002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [05:20:01] (03PS1) 10BryanDavis: openldap: Set default password policy [puppet] - 10https://gerrit.wikimedia.org/r/497684 (https://phabricator.wikimedia.org/T168692) [05:27:49] RECOVERY - puppet last run on mw1246 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [06:07:14] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool hosts in row A" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497589 (owner: 10Jcrespo) [06:08:24] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool hosts in row A" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497589 (owner: 10Jcrespo) [06:09:33] (03CR) 10Marostegui: [C: 03+2] wikireplica_dns.yaml: Depool dbproxy1010 [puppet] - 10https://gerrit.wikimedia.org/r/497228 (owner: 10Marostegui) [06:09:43] (03Abandoned) 10Marostegui: wikireplica_dns.yaml: Depool dbproxy1010 [puppet] - 10https://gerrit.wikimedia.org/r/497228 (owner: 10Marostegui) [06:09:58] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool databases in row A - T187960 (duration: 00m 49s) [06:10:00] marostegui@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [06:10:01] T187960: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 [06:10:04] (03Abandoned) 10Marostegui: db-eqiad.php: Promote db1120 as x1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496724 (https://phabricator.wikimedia.org/T187960) (owner: 10Marostegui) [06:10:27] (03Abandoned) 10Marostegui: mariadb: Promote db1120 as x1 master [puppet] - 10https://gerrit.wikimedia.org/r/496723 (https://phabricator.wikimedia.org/T187960) (owner: 10Marostegui) [06:10:57] (03Abandoned) 10Marostegui: db-eqiad.php: Failover db1066 to db1076 on s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496721 (https://phabricator.wikimedia.org/T187960) (owner: 10Marostegui) [06:11:03] (03Abandoned) 10Marostegui: mariadb: Failover db1066 to db1076 on s2 [puppet] - 10https://gerrit.wikimedia.org/r/496720 (https://phabricator.wikimedia.org/T187960) (owner: 10Marostegui) [06:29:31] PROBLEM - puppet last run on aqs1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [06:55:51] RECOVERY - puppet last run on aqs1005 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [07:20:13] (03PS1) 10Vgutierrez: test [puppet] - 10https://gerrit.wikimedia.org/r/497691 [07:20:39] (03PS1) 10Marostegui: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497692 [07:21:44] (03PS1) 10Marostegui: wikireplica_dns.yaml: Depool dbproxy1010 [puppet] - 10https://gerrit.wikimedia.org/r/497693 [07:22:30] (03PS7) 10Giuseppe Lavagetto: profile::mediawiki::maintenance: systemd-timer based periodic jobs [puppet] - 10https://gerrit.wikimedia.org/r/482792 (https://phabricator.wikimedia.org/T211250) [07:31:23] (03Abandoned) 10Vgutierrez: test [puppet] - 10https://gerrit.wikimedia.org/r/497691 (owner: 10Vgutierrez) [07:32:14] !log pool kafka1001 in pybal's eventbus service after yesterday's network maintenance [07:32:14] elukey: Failed to log message to wiki. Somebody should check the error logs. [07:58:13] (03PS2) 10Marostegui: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497692 [08:07:21] (03PS1) 10Mathew.onipe: multi-instance for elastic deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/497698 (https://phabricator.wikimedia.org/T213940) [08:19:45] PROBLEM - MariaDB Slave Lag: s5 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 823.47 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [08:20:12] !log cp2009, cp1071 (cp-ats): reboot for kernel upgrades [08:20:12] ema: Failed to log message to wiki. Somebody should check the error logs. [08:20:53] PROBLEM - Host cp2009 is DOWN: PING CRITICAL - Packet loss = 100% [08:21:39] uh? ah :) [08:21:57] RECOVERY - Host cp2009 is UP: PING OK - Packet loss = 0%, RTA = 36.14 ms [08:22:30] vgutierrez: lol [08:23:05] elukey: I dream about icinga auto acks based on !log entries [08:24:45] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497692 (owner: 10Marostegui) [08:25:43] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497692 (owner: 10Marostegui) [08:26:48] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1092 (duration: 00m 48s) [08:26:48] marostegui@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [08:26:51] aaahhh, the pleasure of seeing cache hits immediately after reboot. Thank you, ATS for persistent storage! [08:27:26] ^^ [08:28:27] 10Operations, 10ops-eqiad, 10Cognate, 10Growth-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10jcrespo) labsdb1009.mgmt (stress on management interface) is down according to icinga for 14 hours (around net maintenance), maybe a loose cable o... [08:30:28] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497700 [08:31:27] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497700 (owner: 10Marostegui) [08:32:31] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497700 (owner: 10Marostegui) [08:33:27] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1092 (duration: 00m 48s) [08:33:27] marostegui@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [08:38:23] (03CR) 10Marostegui: [C: 03+2] wikireplica_dns.yaml: Depool dbproxy1010 [puppet] - 10https://gerrit.wikimedia.org/r/497693 (owner: 10Marostegui) [08:40:07] RECOVERY - MariaDB Slave Lag: s5 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 13.65 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [08:42:45] (03PS1) 10Marostegui: db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497701 [08:44:39] (03PS1) 10Elukey: yarn: allow the configuration of maximum app ids retained in HDFS/Zookeeper [puppet/cdh] - 10https://gerrit.wikimedia.org/r/497702 (https://phabricator.wikimedia.org/T218758) [08:45:18] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497701 (owner: 10Marostegui) [08:46:27] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497701 (owner: 10Marostegui) [08:47:29] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1109 (duration: 00m 48s) [08:47:30] marostegui@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [08:48:33] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497703 [08:50:50] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497703 (owner: 10Marostegui) [08:51:43] (03PS2) 10Elukey: yarn: allow the configuration of maximum app ids retained in HDFS/Zookeeper [puppet/cdh] - 10https://gerrit.wikimedia.org/r/497702 (https://phabricator.wikimedia.org/T218758) [08:52:04] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497703 (owner: 10Marostegui) [08:53:03] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1109 (duration: 00m 48s) [08:53:03] marostegui@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [08:55:03] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/15221/" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/497702 (https://phabricator.wikimedia.org/T218758) (owner: 10Elukey) [08:55:11] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=prometheus1003.eqiad.wmnet [08:55:12] filippo@puppetmaster1001: Failed to log message to wiki. Somebody should check the error logs. [08:56:09] (03PS1) 10Marostegui: db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497704 [08:57:03] (03CR) 10Gehel: [C: 03+1] "LGTM, ping me when you want to deploy" [puppet] - 10https://gerrit.wikimedia.org/r/497662 (https://phabricator.wikimedia.org/T218457) (owner: 10Smalyshev) [08:57:54] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497704 (owner: 10Marostegui) [08:58:53] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497704 (owner: 10Marostegui) [08:59:52] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1087 (duration: 00m 48s) [08:59:53] marostegui@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [09:02:25] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497705 [09:03:30] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497705 (owner: 10Marostegui) [09:04:37] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497705 (owner: 10Marostegui) [09:06:00] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1087 (duration: 00m 48s) [09:06:00] marostegui@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [09:06:31] (03PS10) 10Jcrespo: mariadb-snapshots: Better error and logging handling [puppet] - 10https://gerrit.wikimedia.org/r/496746 (https://phabricator.wikimedia.org/T210292) [09:06:33] (03PS1) 10Jcrespo: mariadb-backups: Make some dump options explicit [puppet] - 10https://gerrit.wikimedia.org/r/497708 (https://phabricator.wikimedia.org/T206203) [09:09:31] (03PS2) 10Jcrespo: mariadb-backups: Make some dump options explicit [puppet] - 10https://gerrit.wikimedia.org/r/497708 (https://phabricator.wikimedia.org/T206203) [09:14:27] (03CR) 10DCausse: [C: 04-1] elasticsearch: remove from systemd unit (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/497503 (https://phabricator.wikimedia.org/T218315) (owner: 10Mathew.onipe) [09:18:10] (03PS1) 10Muehlenhoff: Remove unused reset-ldap-password script [puppet] - 10https://gerrit.wikimedia.org/r/497710 [09:18:25] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1013 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1 [09:19:12] hello kafka [09:22:31] ah of course the graph is not right [09:22:40] this is an old cluster still using graphite [09:23:20] https://grafana.wikimedia.org/d/000000523/kafka-graphite?refresh=5m&panelId=29&fullscreen&orgId=1&from=now-24h&to=now [09:23:48] this has been ongoing for a while, probably a consequence of yesterday's issue with network maintenance (two kafka hosts down) [09:25:46] restarting kafka on kafka1013 atm [09:32:29] !log Deploy schema change on s7 codfw master, lag will appear on codfw [09:32:29] marostegui: Failed to log message to wiki. Somebody should check the error logs. [09:32:38] so after that, from kafka topics --describe, it seems that all the ISR sets (basically the 3 nodes in sync with a specific topic partition) are missing 1012 and 1022 [09:32:45] so going to restart 1012 now [09:34:54] ok 12 is now appearing into ISR sets, good. Going to restart 1022 in a bit [09:34:58] that should do the trick [09:36:47] (03CR) 10Gehel: [C: 04-1] multi-instance for elastic deployment-prep (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/497698 (https://phabricator.wikimedia.org/T213940) (owner: 10Mathew.onipe) [09:53:53] (03PS7) 10Vgutierrez: acme_chief: Update acme_chief::cert resource to fetch several cert versions [puppet] - 10https://gerrit.wikimedia.org/r/496148 (https://phabricator.wikimedia.org/T207295) [09:54:09] (03CR) 10Gehel: [C: 04-1] elasticsearch: add profile for icinga checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [09:58:14] (03CR) 10Vgutierrez: "I've refactored this a little bit.. Instead of picking between files or directory, just deploy both of them (there is no reason to choose " [puppet] - 10https://gerrit.wikimedia.org/r/496148 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [10:01:42] (03CR) 10Vgutierrez: [C: 03+1] "@krenair let me know if you see any flaw in this new approach, but I think it's ready to go" [puppet] - 10https://gerrit.wikimedia.org/r/496148 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [10:11:55] (03CR) 10Vgutierrez: acme_chief: Add security::access::config on passive host if realm == labs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/497430 (owner: 10Alex Monk) [10:12:46] !log Reboot dbproxy1010 for upgrade [10:12:46] marostegui: Failed to log message to wiki. Somebody should check the error logs. [10:13:21] (03PS1) 10Elukey: confluent::kafka::broker::alerts: fix dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/497711 [10:14:22] (03CR) 10Elukey: [C: 03+2] confluent::kafka::broker::alerts: fix dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/497711 (owner: 10Elukey) [10:15:17] ok the wrong graphs for kafka should be fixed [10:15:31] and icinga should recover soon [10:17:26] (03PS1) 10Marostegui: Revert "wikireplica_dns.yaml: Depool dbproxy1010" [puppet] - 10https://gerrit.wikimedia.org/r/497712 [10:17:41] (03PS2) 10Marostegui: Revert "wikireplica_dns.yaml: Depool dbproxy1010" [puppet] - 10https://gerrit.wikimedia.org/r/497712 [10:18:27] (03CR) 10Marostegui: [C: 03+2] Revert "wikireplica_dns.yaml: Depool dbproxy1010" [puppet] - 10https://gerrit.wikimedia.org/r/497712 (owner: 10Marostegui) [10:19:21] PROBLEM - puppet last run on mw1281 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:20:45] !log Repool dbproxy1010 and running wmcs-wikireplica-dns script [10:20:45] marostegui: Failed to log message to wiki. Somebody should check the error logs. [10:22:35] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Good point. Let's do INFO for now in order to have quite verbose information and work from there. Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/497561 (owner: 10Alexandros Kosiaris) [10:22:43] (03PS2) 10Alexandros Kosiaris: gerrit: enable httpd request log [puppet] - 10https://gerrit.wikimedia.org/r/497561 [10:23:18] (03PS27) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) [10:23:27] (03CR) 10Mathew.onipe: elasticsearch: add profile for icinga checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [10:24:30] (03PS1) 10Vgutierrez: CI: Run tests with minimum and latest dependencies [software/acme-chief] - 10https://gerrit.wikimedia.org/r/497715 (https://phabricator.wikimedia.org/T213820) [10:25:07] (03Abandoned) 10Vgutierrez: CI: Run tests with minimum and latest dependencies [software/certcentral] - 10https://gerrit.wikimedia.org/r/485017 (https://phabricator.wikimedia.org/T213820) (owner: 10Vgutierrez) [10:26:57] !log reimage prometheus1003 with stretch - T205870 [10:26:59] godog: Failed to log message to wiki. Somebody should check the error logs. [10:27:00] T205870: Provision >= 50% of statsd/Graphite-only metrics in Prometheus - https://phabricator.wikimedia.org/T205870 [10:27:20] (03CR) 10Mathew.onipe: "PCC is Ok and expected: https://puppet-compiler.wmflabs.org/compiler1002/15223/" [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [10:28:01] !log restart gerrit for merge of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/497561/ [10:28:01] akosiaris: Failed to log message to wiki. Somebody should check the error logs. [10:34:29] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_jenkins CI slave scripts] [10:35:07] PROBLEM - puppet last run on eventlog1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [10:37:11] PROBLEM - puppet last run on vega is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 7 minutes ago with 8 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/annualreport],Exec[git_pull_wikimedia/TransparencyReport],Exec[git_pull_wikimedia/TransparencyReport-private],Exec[git_pull_wikibase/wikiba.se-deploy] [10:43:25] (03PS1) 10Hashar: Revert "gerrit: enable httpd request log" [puppet] - 10https://gerrit.wikimedia.org/r/497726 [10:43:27] (03PS1) 10Hashar: gerrit: add User in http response for logging [puppet] - 10https://gerrit.wikimedia.org/r/497727 [10:50:08] RECOVERY - puppet last run on mw1281 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:50:21] (03CR) 10Gehel: [C: 04-1] elasticsearch: add profile for icinga checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [10:52:39] (03PS17) 10Mathew.onipe: cloudelastic: Add cloudelastic configs [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) [10:52:41] (03PS28) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) [10:53:41] (03CR) 10jerkins-bot: [V: 04-1] cloudelastic: Add cloudelastic configs [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [10:54:50] (03Abandoned) 10Hashar: Revert "gerrit: enable httpd request log" [puppet] - 10https://gerrit.wikimedia.org/r/497726 (owner: 10Hashar) [10:54:55] (03Abandoned) 10Hashar: gerrit: add User in http response for logging [puppet] - 10https://gerrit.wikimedia.org/r/497727 (owner: 10Hashar) [10:56:11] (03PS1) 10Filippo Giunchedi: mediawiki: move logging pipeline rsyslog shim from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/497729 (https://phabricator.wikimedia.org/T218764) [10:59:22] (03CR) 10Dzahn: [C: 03+1] Remove unused reset-ldap-password script [puppet] - 10https://gerrit.wikimedia.org/r/497710 (owner: 10Muehlenhoff) [10:59:52] (03CR) 10Filippo Giunchedi: [C: 03+1] "PCC https://puppet-compiler.wmflabs.org/compiler1002/15224/" [puppet] - 10https://gerrit.wikimedia.org/r/497729 (https://phabricator.wikimedia.org/T218764) (owner: 10Filippo Giunchedi) [10:59:54] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190320T1100). [11:00:04] alaa_wmde, Ebe123, jan_drewniak, and Amir1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:16] o/ [11:00:22] RECOVERY - puppet last run on eventlog1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:00:26] \o [11:00:28] you can do mine and alaa_wmde a little bit later if it's fine [11:00:59] o/ [11:02:00] Here for SWAT; https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/497433 [11:02:40] RECOVERY - puppet last run on vega is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:03:31] !log restart gerrit for testing https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/497727/ [11:03:32] akosiaris: Failed to log message to wiki. Somebody should check the error logs. [11:03:49] I'm around but I would prefer not to swat today. can somebody else swat? [11:04:27] oh, gerrit is down. well, in that case there's no swat anyway :P [11:04:42] zeljkof: I think it is just restarting for testing [11:05:10] "restarting for testing"? [11:05:27] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "After reading https://phabricator.wikimedia.org/T218764#5039572 I agree with this." [puppet] - 10https://gerrit.wikimedia.org/r/497729 (https://phabricator.wikimedia.org/T218764) (owner: 10Filippo Giunchedi) [11:06:04] ok, gerrit is back. looks like it was just rebooted [11:08:26] (03PS3) 10Filippo Giunchedi: prometheus: don't require Prometheus::Server when writing k8s token [puppet] - 10https://gerrit.wikimedia.org/r/490834 (https://phabricator.wikimedia.org/T187987) [11:08:29] (03PS1) 10Filippo Giunchedi: hieradata: run prometheus 2 on prometheus1003 [puppet] - 10https://gerrit.wikimedia.org/r/497746 (https://phabricator.wikimedia.org/T187987) [11:08:37] (03CR) 10Hashar: [C: 03+1] "Looks legit :) Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/497729 (https://phabricator.wikimedia.org/T218764) (owner: 10Filippo Giunchedi) [11:08:48] PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_jenkins CI slave scripts] [11:09:16] (03PS2) 10Muehlenhoff: Remove unused reset-ldap-password script [puppet] - 10https://gerrit.wikimedia.org/r/497710 [11:09:48] (03CR) 10Hashar: [C: 03+1] "Might want to add Hosts: header in the commit message that points to labtestweb / labweb1001 and labweb1002 and run the puppet compiler to" [puppet] - 10https://gerrit.wikimedia.org/r/497729 (https://phabricator.wikimedia.org/T218764) (owner: 10Filippo Giunchedi) [11:10:13] zeljkof: okay, I swat [11:10:26] Amir1: thanks :) [11:11:25] (03PS2) 10Ladsgroup: Increased maxSerializedEntitySize from 2500 to 3000 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496161 (https://phabricator.wikimedia.org/T217739) (owner: 10Mahveotm) [11:11:36] (03CR) 10Muehlenhoff: [C: 03+2] Remove unused reset-ldap-password script [puppet] - 10https://gerrit.wikimedia.org/r/497710 (owner: 10Muehlenhoff) [11:11:38] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496161 (https://phabricator.wikimedia.org/T217739) (owner: 10Mahveotm) [11:12:34] (03Merged) 10jenkins-bot: Increased maxSerializedEntitySize from 2500 to 3000 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496161 (https://phabricator.wikimedia.org/T217739) (owner: 10Mahveotm) [11:14:41] alaa_wmde: yours is going live [11:15:45] (03PS6) 10Ladsgroup: Partially revert "Enable musical notation datatype in wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497433 (https://phabricator.wikimedia.org/T218535) (owner: 10Ebe123) [11:16:13] (03CR) 10Ladsgroup: "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497433 (https://phabricator.wikimedia.org/T218535) (owner: 10Ebe123) [11:16:15] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:496161|Increased maxSerializedEntitySize from 2500 to 3000 (T217739)]] (duration: 01m 47s) [11:16:17] ladsgroup@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [11:16:18] T217739: Set maxSerializedEntitySize to 3000 - https://phabricator.wikimedia.org/T217739 [11:16:42] (03PS1) 10Alexandros Kosiaris: Revert "gerrit: enable httpd request log" [puppet] - 10https://gerrit.wikimedia.org/r/497747 [11:16:44] (03PS1) 10Alexandros Kosiaris: gerrit: add User in http response for logging [puppet] - 10https://gerrit.wikimedia.org/r/497748 [11:17:09] (03CR) 10Filippo Giunchedi: [C: 03+1] "Another PCC run on more hosts (from cumin 'R:class = profile::mediawiki::common')" [puppet] - 10https://gerrit.wikimedia.org/r/497729 (https://phabricator.wikimedia.org/T218764) (owner: 10Filippo Giunchedi) [11:17:15] (03CR) 10Ladsgroup: [C: 03+2] Partially revert "Enable musical notation datatype in wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497433 (https://phabricator.wikimedia.org/T218535) (owner: 10Ebe123) [11:17:51] (03CR) 10Alexandros Kosiaris: [C: 03+2] Revert "gerrit: enable httpd request log" [puppet] - 10https://gerrit.wikimedia.org/r/497747 (owner: 10Alexandros Kosiaris) [11:18:13] (03CR) 10Alexandros Kosiaris: [C: 03+2] gerrit: add User in http response for logging [puppet] - 10https://gerrit.wikimedia.org/r/497748 (owner: 10Alexandros Kosiaris) [11:18:22] Ebe123: yours is going live [11:18:25] (03Merged) 10jenkins-bot: Partially revert "Enable musical notation datatype in wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497433 (https://phabricator.wikimedia.org/T218535) (owner: 10Ebe123) [11:19:17] (03PS2) 10Filippo Giunchedi: mediawiki: move logging pipeline rsyslog shim from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/497729 (https://phabricator.wikimedia.org/T218764) [11:19:29] alaa_wmde Ebe123 the musical notation patch is live in mwdebug1002 [11:19:40] test and then tell me if it's okay to go live [11:19:42] @Amir1 checking now [11:20:03] Amir1: please LMK one the train is done, I'll merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/497729 then [11:20:08] s/one/once/ [11:20:18] godog: sure [11:23:18] thanks! [11:23:28] (03PS1) 10Alexandros Kosiaris: Followup for e5c490f6536c25. Move to httpd module [puppet] - 10https://gerrit.wikimedia.org/r/497750 [11:23:45] 10Operations, 10netops, 10Patch-For-Review: IGMP snooping breaks IPv6 ND on Junos 14.1X53-D46 - https://phabricator.wikimedia.org/T201039 (10jcrespo) Just to give an idea followup of es1014, issue seem gone: ` jynus@prometheus1004:~$ ping es1014.eqiad.wmnet PING es1014.eqiad.wmnet (10.64.16.187) 56(84) byt... [11:24:12] 10Operations, 10Puppet, 10puppet-compiler, 10Release-Engineering-Team (Watching / External): Integrate the puppet compiler in the puppet CI pipeline - https://phabricator.wikimedia.org/T166066 (10hashar) We have a Jenkins job T97513 which has been made to recognizes `Hosts:` in commit message to passe a li... [11:24:33] (03CR) 10Alexandros Kosiaris: [C: 03+2] Followup for e5c490f6536c25. Move to httpd module [puppet] - 10https://gerrit.wikimedia.org/r/497750 (owner: 10Alexandros Kosiaris) [11:24:54] checking, Amir1 [11:27:15] (03CR) 10Hashar: [C: 03+1] "So seems deploy/snapshot hosts do no have any proper logging emitted to logstash ? :(" [puppet] - 10https://gerrit.wikimedia.org/r/497729 (https://phabricator.wikimedia.org/T218764) (owner: 10Filippo Giunchedi) [11:27:27] I checked musical notation on wikidata and it's good [11:27:39] (03PS3) 10Ladsgroup: Enable Advanced Mobile Contributions mode for ar,id,es and test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497500 (owner: 10Pmiazga) [11:27:47] Works for me [11:27:50] going live then [11:27:55] seems to work when previewing on wikisource as mentioned in the ticket, but will wait for @Ebe123 [11:27:59] awesome! [11:28:00] jan_drewniak: you're next [11:28:07] alaa_wmde, Wikidata was never affected [11:28:13] Amir1: can mine be put on mwdebug before going live [11:28:24] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497500 (owner: 10Pmiazga) [11:28:30] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:497433|Partially revert "Enable musical notation datatype in wikidata" (T218535)]] (duration: 00m 50s) [11:28:32] ladsgroup@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [11:28:33] T218535: Score extension now leaving large amounts of space on rendering - https://phabricator.wikimedia.org/T218535 [11:28:36] jan_drewniak: sure [11:28:40] @Ebe123 yup just wanted to make sure it is still not functioning fine (although not expected to break) [11:28:53] still functioning* [11:29:23] (03Merged) 10jenkins-bot: Enable Advanced Mobile Contributions mode for ar,id,es and test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497500 (owner: 10Pmiazga) [11:30:01] jan_drewniak: live in mwdebug1002 [11:33:13] Amir1: looks good! [11:33:30] akosiaris: there’s a syntax error in https://gerrit.wikimedia.org/r/c/operations/puppet/+/497750/1/modules/httpd/files/defaults.conf#29 [11:33:40] (See line at the bottom) [11:33:45] I know [11:33:47] fixing [11:34:02] !log disable puppet across fleet to avoid alert spam storm [11:34:02] akosiaris: Failed to log message to wiki. Somebody should check the error logs. [11:34:03] jan_drewniak: okie dokie, going live [11:34:09] Ok [11:34:20] akosiaris: should I stop SWAT? [11:34:31] Amir1: no, no need [11:34:38] puppet mistake, feel free to proceed [11:34:45] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:497500|Enable Advanced Mobile Contributions mode for ar,id,es and test wikis (T217643)]] (duration: 00m 50s) [11:34:47] ladsgroup@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [11:34:48] T217643: Deploy AMC pt 1 to Arabic, Indonesian, and Spanish Wikipedias - https://phabricator.wikimedia.org/T217643 [11:35:13] paladox: thanks! [11:35:35] seems like sleeplessness is starting to have an effect [11:36:02] Oh and your welcome :) [11:36:07] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497579 (https://phabricator.wikimedia.org/T217730) (owner: 10Ladsgroup) [11:36:16] (03PS1) 10Alexandros Kosiaris: Fix typo in apache/httpd defaults.conf [puppet] - 10https://gerrit.wikimedia.org/r/497751 [11:37:13] (03Merged) 10jenkins-bot: Add wikimaniawiki to wikidataclient.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497579 (https://phabricator.wikimedia.org/T217730) (owner: 10Ladsgroup) [11:37:41] (03CR) 10Alexandros Kosiaris: [C: 03+2] Fix typo in apache/httpd defaults.conf [puppet] - 10https://gerrit.wikimedia.org/r/497751 (owner: 10Alexandros Kosiaris) [11:40:24] Amir1: thanks! [11:40:55] !log ladsgroup@deploy1001 Synchronized dblists/wikidataclient.dblist: SWAT: [[gerrit:497579|Add wikimaniawiki to wikidataclient.dblist (T217730)]] (duration: 00m 50s) [11:40:56] (03PS2) 10Ladsgroup: Add wikimania as a special group to wikidata sitelinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497578 (https://phabricator.wikimedia.org/T217730) [11:40:56] ladsgroup@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [11:40:57] T217730: Connect wikimaniawiki to Wikidata - https://phabricator.wikimedia.org/T217730 [11:41:06] (03CR) 10Ladsgroup: [C: 03+2] Add wikimania as a special group to wikidata sitelinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497578 (https://phabricator.wikimedia.org/T217730) (owner: 10Ladsgroup) [11:41:18] thanks @Amir1 for doing the swat [11:41:41] you are very welcome [11:42:12] (03Merged) 10jenkins-bot: Add wikimania as a special group to wikidata sitelinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497578 (https://phabricator.wikimedia.org/T217730) (owner: 10Ladsgroup) [11:44:57] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:497578|Add wikimania as a special group to wikidata sitelinks (T217730)]] (duration: 00m 50s) [11:44:59] ladsgroup@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [11:45:10] !log EU SWAT is done [11:45:10] Amir1: Failed to log message to wiki. Somebody should check the error logs. [11:45:13] cc zeljkof [11:51:56] !log re-enable puppet across fleet [11:51:57] akosiaris: Failed to log message to wiki. Somebody should check the error logs. [11:54:39] thanks Amir1 [11:54:39] (03PS4) 10Jbond: Add component/ci wikimedia repository to CI hosts [puppet] - 10https://gerrit.wikimedia.org/r/495681 (https://phabricator.wikimedia.org/T212774) [11:55:23] PROBLEM - puppet last run on ms-be1043 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[parted-/dev/sdk] [11:55:25] RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [11:57:17] PROBLEM - puppet last run on cloudnet2001-dev is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[apt-wikimedia-sources],Exec[apt-debian-backports-sources] [11:58:40] (03CR) 10Jbond: [C: 03+2] Add component/ci wikimedia repository to CI hosts [puppet] - 10https://gerrit.wikimedia.org/r/495681 (https://phabricator.wikimedia.org/T212774) (owner: 10Jbond) [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190320T1200) [12:01:09] 10Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Upgrade jenkins-debian-glue to v0.20.0 - https://phabricator.wikimedia.org/T212774 (10jbond) patch is merged, let me know if there is anything elses from my side [12:05:29] PROBLEM - puppet last run on mw1312 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/cgroup-mediawiki-clean] [12:22:55] (03PS1) 10Muehlenhoff: Add --disable-user option to offboard script [puppet] - 10https://gerrit.wikimedia.org/r/497758 [12:23:40] (03CR) 10jerkins-bot: [V: 04-1] Add --disable-user option to offboard script [puppet] - 10https://gerrit.wikimedia.org/r/497758 (owner: 10Muehlenhoff) [12:24:05] (03PS3) 10Filippo Giunchedi: mediawiki: move logging pipeline rsyslog shim from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/497729 (https://phabricator.wikimedia.org/T218764) [12:26:48] (03CR) 10Filippo Giunchedi: [C: 03+2] mediawiki: move logging pipeline rsyslog shim from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/497729 (https://phabricator.wikimedia.org/T218764) (owner: 10Filippo Giunchedi) [12:31:45] RECOVERY - puppet last run on mw1312 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:35:44] (03CR) 10Alexandros Kosiaris: eventgate-analytics - adjustments to statsd exporter matches (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/497612 (https://phabricator.wikimedia.org/T218305) (owner: 10Ottomata) [12:37:52] (03PS2) 10Muehlenhoff: Add --disable-user option to offboard script [puppet] - 10https://gerrit.wikimedia.org/r/497758 [12:38:37] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analytics - adjustments to statsd exporter matches (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/497612 (https://phabricator.wikimedia.org/T218305) (owner: 10Ottomata) [12:45:05] (03CR) 10Filippo Giunchedi: eventgate-analytics - adjustments to statsd exporter matches (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/497612 (https://phabricator.wikimedia.org/T218305) (owner: 10Ottomata) [12:46:56] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analytics - adjustments to statsd exporter matches (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/497612 (https://phabricator.wikimedia.org/T218305) (owner: 10Ottomata) [12:48:37] (03PS2) 10Filippo Giunchedi: hieradata: run prometheus 2 on prometheus1003 [puppet] - 10https://gerrit.wikimedia.org/r/497746 (https://phabricator.wikimedia.org/T187987) [12:48:45] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: run prometheus 2 on prometheus1003 [puppet] - 10https://gerrit.wikimedia.org/r/497746 (https://phabricator.wikimedia.org/T187987) (owner: 10Filippo Giunchedi) [12:55:51] (03PS4) 10Filippo Giunchedi: prometheus: don't require Prometheus::Server when writing k8s token [puppet] - 10https://gerrit.wikimedia.org/r/490834 (https://phabricator.wikimedia.org/T187987) [12:55:53] (03PS1) 10Filippo Giunchedi: prometheus: use yaml rules for prometheus v2 k8s-staging [puppet] - 10https://gerrit.wikimedia.org/r/497761 (https://phabricator.wikimedia.org/T187987) [12:57:16] (03PS1) 10GTirloni: labstore: Increase nfs-exportd interval from 60 to 300s [puppet] - 10https://gerrit.wikimedia.org/r/497762 (https://phabricator.wikimedia.org/T217086) [12:57:37] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: use yaml rules for prometheus v2 k8s-staging [puppet] - 10https://gerrit.wikimedia.org/r/497761 (https://phabricator.wikimedia.org/T187987) (owner: 10Filippo Giunchedi) [13:00:04] zeljkof: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - European version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190320T1300). [13:04:58] train is blocked on security patch [13:05:18] Did we ever have a train yesterday? [13:05:36] (US) [13:09:40] (03PS1) 10Ottomata: eventgate-analytics - remove confusing '_histogram' suffix from summary quantiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/497763 (https://phabricator.wikimedia.org/T218305) [13:10:50] Zppix: no [13:10:57] I was able to cut the branch, that's all [13:11:15] as soon as the security patch is fixed, I'll start the train [13:11:36] this week it's in EU time, but some things might happen in US window, depending no how it goes [13:12:04] zeljkof: so the train is behind (so the groups will be deployed a day behind) [13:12:20] ACKNOWLEDGEMENT - HP RAID on db2052 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:10 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T218776 [13:12:32] 10Operations, 10ops-codfw: Degraded RAID on db2052 - https://phabricator.wikimedia.org/T218776 (10ops-monitoring-bot) [13:12:56] Zppix: we might catch up today to group 0 and 1, depending if there are any problems [13:13:09] zeljkof: okay good to know thanks [13:13:31] (03PS1) 10Gehel: Expose failed results as part of RemoteExecutionError. [software/spicerack] - 10https://gerrit.wikimedia.org/r/497764 [13:17:37] (03CR) 10jerkins-bot: [V: 04-1] Expose failed results as part of RemoteExecutionError. [software/spicerack] - 10https://gerrit.wikimedia.org/r/497764 (owner: 10Gehel) [13:19:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, but let's wait or Brooke to review this as well." [puppet] - 10https://gerrit.wikimedia.org/r/497762 (https://phabricator.wikimedia.org/T217086) (owner: 10GTirloni) [13:19:21] train continues [13:19:46] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analytics - remove confusing '_histogram' suffix from summary quantiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/497763 (https://phabricator.wikimedia.org/T218305) (owner: 10Ottomata) [13:29:48] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [13:29:49] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [13:29:51] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [13:29:51] !log otto@deploy1001 scap-helm eventgate-analytics finished [13:29:51] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [13:29:51] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [13:34:03] (03PS1) 10Zfilipin: Group0 to 1.33.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497765 [13:37:01] !log zfilipin@deploy1001 clean aborted: Pruned MediaWiki: 1.33.0-wmf.17 (duration: 00m 08s) [13:37:02] zfilipin@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [13:42:11] (03PS1) 10Jbond: Move qualified parameters to there correct location [puppet] - 10https://gerrit.wikimedia.org/r/497767 [13:42:49] (03CR) 10jerkins-bot: [V: 04-1] Move qualified parameters to there correct location [puppet] - 10https://gerrit.wikimedia.org/r/497767 (owner: 10Jbond) [13:43:01] (03PS3) 10Arturo Borrero Gonzalez: openstack: add keystone support for mitaka/stretch in cloudcontrol servers [puppet] - 10https://gerrit.wikimedia.org/r/489275 (https://phabricator.wikimedia.org/T215407) [13:43:53] (03PS2) 10Jbond: Move qualified parameters to there correct location [puppet] - 10https://gerrit.wikimedia.org/r/497767 [13:43:56] (03CR) 10jerkins-bot: [V: 04-1] openstack: add keystone support for mitaka/stretch in cloudcontrol servers [puppet] - 10https://gerrit.wikimedia.org/r/489275 (https://phabricator.wikimedia.org/T215407) (owner: 10Arturo Borrero Gonzalez) [13:44:05] !log zfilipin@deploy1001 clean aborted: Pruned MediaWiki: 1.33.0-wmf.17 (duration: 00m 05s) [13:44:05] zfilipin@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [13:44:30] (03CR) 10jerkins-bot: [V: 04-1] Move qualified parameters to there correct location [puppet] - 10https://gerrit.wikimedia.org/r/497767 (owner: 10Jbond) [13:46:28] (03PS3) 10Jbond: Move qualified parameters to there correct location [puppet] - 10https://gerrit.wikimedia.org/r/497767 [13:47:08] (03CR) 10jerkins-bot: [V: 04-1] Move qualified parameters to there correct location [puppet] - 10https://gerrit.wikimedia.org/r/497767 (owner: 10Jbond) [13:48:54] !log take a snapshot of prometheus data on prometheus1004 [13:48:54] godog: Failed to log message to wiki. Somebody should check the error logs. [13:49:01] that's going to cause some unknowns in icinga btw [13:49:58] (03PS1) 10Gehel: Hacking some logging to investigate apt-get failure. [cookbooks] - 10https://gerrit.wikimedia.org/r/497768 [13:51:57] (03PS4) 10Arturo Borrero Gonzalez: openstack: add keystone support for mitaka/stretch in cloudcontrol servers [puppet] - 10https://gerrit.wikimedia.org/r/489275 (https://phabricator.wikimedia.org/T215407) [13:53:02] (03CR) 10jerkins-bot: [V: 04-1] openstack: add keystone support for mitaka/stretch in cloudcontrol servers [puppet] - 10https://gerrit.wikimedia.org/r/489275 (https://phabricator.wikimedia.org/T215407) (owner: 10Arturo Borrero Gonzalez) [13:55:25] (03PS4) 10Jbond: Move qualified parameters to there correct location [puppet] - 10https://gerrit.wikimedia.org/r/497767 [13:58:12] (03PS1) 10Alexandros Kosiaris: Switchover oresrdb.svc.codfw.wmnet for kernel upgrades [dns] - 10https://gerrit.wikimedia.org/r/497772 [13:58:31] (03PS5) 10Arturo Borrero Gonzalez: openstack: add keystone support for mitaka/stretch in cloudcontrol servers [puppet] - 10https://gerrit.wikimedia.org/r/489275 (https://phabricator.wikimedia.org/T215407) [14:01:07] (03PS6) 10Arturo Borrero Gonzalez: openstack: add keystone support for mitaka/stretch in cloudcontrol servers [puppet] - 10https://gerrit.wikimedia.org/r/489275 (https://phabricator.wikimedia.org/T215407) [14:02:24] !log rebooting oresrdb2002 for kernel update [14:02:25] moritzm: Failed to log message to wiki. Somebody should check the error logs. [14:02:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1002/15229/" [puppet] - 10https://gerrit.wikimedia.org/r/489275 (https://phabricator.wikimedia.org/T215407) (owner: 10Arturo Borrero Gonzalez) [14:03:10] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/497767 (owner: 10Jbond) [14:05:20] (03PS5) 10Jbond: Move qualified parameters to there correct location [puppet] - 10https://gerrit.wikimedia.org/r/497767 [14:05:59] (03CR) 10jerkins-bot: [V: 04-1] Move qualified parameters to there correct location [puppet] - 10https://gerrit.wikimedia.org/r/497767 (owner: 10Jbond) [14:07:19] (03CR) 10Alexandros Kosiaris: [C: 03+2] Switchover oresrdb.svc.codfw.wmnet for kernel upgrades [dns] - 10https://gerrit.wikimedia.org/r/497772 (owner: 10Alexandros Kosiaris) [14:15:18] (03PS2) 10Gehel: Hacking some logging to investigate apt-get failure. [cookbooks] - 10https://gerrit.wikimedia.org/r/497768 [14:15:37] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:16:42] (03CR) 10Volans: [C: 03+1] "LGTM for the purpose, let's not forget to use those cases to improve spicerack and cumin API ;)" [cookbooks] - 10https://gerrit.wikimedia.org/r/497768 (owner: 10Gehel) [14:17:11] 10Operations, 10MediaWiki-extensions-PdfHandler, 10Multimedia: Error creating PDF on Commons: "convert: no decode delegate for this image format" (fixed in GS 9.07) - https://phabricator.wikimedia.org/T50007 (10Seb35) FYI I have this error message on some PDF scans (on a private wiki). The issue is with Ghos... [14:17:24] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2052 - https://phabricator.wikimedia.org/T218776 (10Marostegui) p:05Triage→03Normal a:03Papaul Can get this replaced? Thanks! [14:17:39] PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [14:18:05] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [14:21:08] the availability alerts are due to ores in codfw afaics [14:21:33] (03CR) 10Gehel: [C: 03+2] Hacking some logging to investigate apt-get failure. [cookbooks] - 10https://gerrit.wikimedia.org/r/497768 (owner: 10Gehel) [14:22:36] akosiaris: I guess known/expected (re: ores codfw causing 500s and 503s) [14:23:15] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:24:01] RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [14:26:45] Amir1 and I are looking at this too. [14:26:47] godog, ^ [14:26:51] We don't know what it could be. [14:26:58] Looks like ORES is not processing any requests in CODFW. [14:27:23] this was predeeded by a huge spike in CPU activity on our secondary (mirror) redis node. [14:27:35] The primary redis node had no such spike. [14:28:15] halfak: ack, thanks! yeah I asked because I saw a switchover redis patch flying by [14:28:32] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the patch. Prospector is complaining about the unused results, I guess we might need a test to make it happy :)" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/497764 (owner: 10Gehel) [14:28:44] I don't know anything about that. [14:28:46] Hmm [14:32:39] (03PS29) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) [14:33:38] We did get a sudden burst of requests starting 01:20 UTC. We handled them just fine until very recently. [14:33:48] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [14:35:55] !log zfilipin@deploy1001 clean aborted: Pruned MediaWiki: 1.33.0-wmf.17 (duration: 00m 03s) [14:35:55] zfilipin@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [14:40:40] (03PS1) 10Alexandros Kosiaris: Revert "Switchover oresrdb.svc.codfw.wmnet for kernel upgrades" [dns] - 10https://gerrit.wikimedia.org/r/497779 [14:40:46] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Revert "Switchover oresrdb.svc.codfw.wmnet for kernel upgrades" [dns] - 10https://gerrit.wikimedia.org/r/497779 (owner: 10Alexandros Kosiaris) [14:42:17] (03PS6) 10Jbond: Move qualified parameters to there correct location [puppet] - 10https://gerrit.wikimedia.org/r/497767 [14:42:53] (03CR) 10jerkins-bot: [V: 04-1] Move qualified parameters to there correct location [puppet] - 10https://gerrit.wikimedia.org/r/497767 (owner: 10Jbond) [14:43:14] (03CR) 10Volans: [C: 03+1] "LGTM, couple of nitpicks inline." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/497767 (owner: 10Jbond) [14:44:58] 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10fgiunchedi) >>! In T213976#5032956, @Nuria wrote: > @fgiunchedi sounds good, will try to set up short 30 min meetin... [14:48:13] (03PS1) 10Marostegui: db-eqiad.php: Depool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497780 [14:49:29] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497780 (owner: 10Marostegui) [14:50:39] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497780 (owner: 10Marostegui) [14:51:52] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1098:3317 (duration: 00m 56s) [14:51:52] marostegui@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [14:52:08] (03PS30) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) [14:53:07] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [14:55:27] (03PS1) 10Hashar: scap: add logging to clean > prune-git-branches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497781 [15:02:01] (03PS1) 10Brian Wolff: Adjust wikitech account settings. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497782 (https://phabricator.wikimedia.org/T218654) [15:03:04] (03CR) 10jerkins-bot: [V: 04-1] Adjust wikitech account settings. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497782 (https://phabricator.wikimedia.org/T218654) (owner: 10Brian Wolff) [15:04:52] (03PS7) 10Jbond: Move qualified parameters to there correct location [puppet] - 10https://gerrit.wikimedia.org/r/497767 [15:05:30] (03CR) 10jerkins-bot: [V: 04-1] Move qualified parameters to there correct location [puppet] - 10https://gerrit.wikimedia.org/r/497767 (owner: 10Jbond) [15:06:58] (03PS2) 10Brian Wolff: Adjust wikitech account settings. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497782 (https://phabricator.wikimedia.org/T218654) [15:08:26] (03PS8) 10Jbond: Move qualified parameters to there correct location [puppet] - 10https://gerrit.wikimedia.org/r/497767 [15:08:55] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.e6-upgrade [15:08:55] gehel@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [15:09:03] !log gehel@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.e6-upgrade (exit_code=97) [15:09:03] gehel@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [15:09:15] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.e6-upgrade [15:09:15] gehel@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [15:09:19] (03CR) 10jerkins-bot: [V: 04-1] Move qualified parameters to there correct location [puppet] - 10https://gerrit.wikimedia.org/r/497767 (owner: 10Jbond) [15:10:42] !log gehel@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.e6-upgrade (exit_code=99) [15:10:42] gehel@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [15:11:32] (03CR) 10Volans: "Few comments inline (post-merge)" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/496836 (https://phabricator.wikimedia.org/T212526) (owner: 10CRusnov) [15:12:25] I'm going to deploy a config change [15:13:19] (03CR) 10Bstorm: [C: 03+1] "I like the idea. :)" [puppet] - 10https://gerrit.wikimedia.org/r/497762 (https://phabricator.wikimedia.org/T217086) (owner: 10GTirloni) [15:14:32] (03PS1) 10Gehel: Always log output of apt-get install. [cookbooks] - 10https://gerrit.wikimedia.org/r/497784 [15:15:06] (03CR) 10Brian Wolff: [C: 03+2] Adjust wikitech account settings. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497782 (https://phabricator.wikimedia.org/T218654) (owner: 10Brian Wolff) [15:15:59] (03PS9) 10Jbond: Move qualified parameters to there correct location [puppet] - 10https://gerrit.wikimedia.org/r/497767 [15:16:05] (03Merged) 10jenkins-bot: Adjust wikitech account settings. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497782 (https://phabricator.wikimedia.org/T218654) (owner: 10Brian Wolff) [15:17:14] (03CR) 10Jbond: "Thanks Ricardo i hadn't noticed you got auto added so sorry for the noise. should be nearly done now" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/497767 (owner: 10Jbond) [15:17:16] 10Operations, 10ops-eqiad, 10DC-Ops: labsdb1009.mgmt down - https://phabricator.wikimedia.org/T218789 (10ayounsi) p:05Triage→03High [15:17:42] 10Operations, 10ops-eqiad, 10Cognate, 10Growth-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10ayounsi) Thanks, opened T218789 [15:17:52] (03PS10) 10Jbond: Move qualified parameters to there correct location [puppet] - 10https://gerrit.wikimedia.org/r/497767 [15:18:11] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/497767 (owner: 10Jbond) [15:18:13] 10Operations, 10MobileFrontend, 10TechCom-RFC, 10Traffic, 10Readers-Web-Backlog (Tracking): Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Rillke) [15:18:26] (03CR) 10Gehel: [C: 03+2] Always log output of apt-get install. [cookbooks] - 10https://gerrit.wikimedia.org/r/497784 (owner: 10Gehel) [15:18:28] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1098:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497785 [15:20:30] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.e6-upgrade [15:20:30] !log gehel@cumin1001 END (PASS) - Cookbook sre.elasticsearch.e6-upgrade (exit_code=0) [15:20:30] gehel@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [15:20:31] gehel@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [15:21:47] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services: labsdb1009.mgmt down - https://phabricator.wikimedia.org/T218789 (10Marostegui) [15:22:09] !log bawolff@deploy1001 Synchronized wmf-config/wikitech.php: Adjust account stuff at wikitech 4adc89bce4 (duration: 00m 48s) [15:22:09] bawolff@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [15:23:43] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.e6-upgrade [15:23:43] !log gehel@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.e6-upgrade (exit_code=99) [15:23:43] gehel@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [15:23:43] gehel@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [15:24:12] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.e6-upgrade [15:24:12] gehel@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [15:24:30] !log gehel@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.e6-upgrade (exit_code=99) [15:24:30] gehel@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [15:26:03] (03PS11) 10Jcrespo: mariadb-snapshots: Better error and logging handling [puppet] - 10https://gerrit.wikimedia.org/r/496746 (https://phabricator.wikimedia.org/T210292) [15:26:05] (03PS3) 10Jcrespo: mariadb-backups: Make some dump options explicit [puppet] - 10https://gerrit.wikimedia.org/r/497708 (https://phabricator.wikimedia.org/T206203) [15:26:14] (03PS31) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) [15:26:47] (03PS1) 10Dzahn: openldap::management: install python-mysqldb package [puppet] - 10https://gerrit.wikimedia.org/r/497788 [15:32:19] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1098:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497785 (owner: 10Marostegui) [15:33:49] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.e6-upgrade [15:33:49] gehel@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [15:33:58] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1098:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497785 (owner: 10Marostegui) [15:34:47] (03PS2) 10Smalyshev: Moving categories dump storage to dumps/ [puppet] - 10https://gerrit.wikimedia.org/r/497662 (https://phabricator.wikimedia.org/T218457) [15:35:00] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1098:3317 (duration: 00m 50s) [15:35:01] marostegui@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [15:35:04] (03CR) 10Volans: [C: 03+1] "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/398120 (https://phabricator.wikimedia.org/T182819) (owner: 10Herron) [15:35:18] !log gehel@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.e6-upgrade (exit_code=99) [15:35:18] gehel@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [15:35:33] (03CR) 10Smalyshev: "@Gehel you can deploy anythime, this should not break anything." [puppet] - 10https://gerrit.wikimedia.org/r/497662 (https://phabricator.wikimedia.org/T218457) (owner: 10Smalyshev) [15:35:49] (03CR) 10Gehel: [C: 03+2] Moving categories dump storage to dumps/ [puppet] - 10https://gerrit.wikimedia.org/r/497662 (https://phabricator.wikimedia.org/T218457) (owner: 10Smalyshev) [15:39:48] (03PS1) 10Gehel: apt-get needs to be non interactive. [cookbooks] - 10https://gerrit.wikimedia.org/r/497791 [15:40:37] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/497791 (owner: 10Gehel) [15:41:06] (03CR) 10Mathew.onipe: "PCC output is Ok: https://puppet-compiler.wmflabs.org/compiler1001/15232/" [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [15:44:45] (03PS1) 10Marostegui: wikireplica_dns.yaml: Depool dbproxy1011 [puppet] - 10https://gerrit.wikimedia.org/r/497793 [15:45:19] (03CR) 10Marostegui: "I am aiming to push this tomorrow morning" [puppet] - 10https://gerrit.wikimedia.org/r/497793 (owner: 10Marostegui) [15:45:46] (03CR) 10Jcrespo: [C: 03+1] wikireplica_dns.yaml: Depool dbproxy1011 [puppet] - 10https://gerrit.wikimedia.org/r/497793 (owner: 10Marostegui) [15:45:57] (03CR) 10Jbond: "small nitpick. Also for my benefit and not proposing for this change. Is there a reason why this writes an ldif and has the user run it w" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/497758 (owner: 10Muehlenhoff) [15:46:34] (03CR) 10Gehel: [C: 03+2] apt-get needs to be non interactive. [cookbooks] - 10https://gerrit.wikimedia.org/r/497791 (owner: 10Gehel) [15:50:44] (03CR) 10Volans: [C: 03+1] "LGTM, very minor nitpicks inline" (033 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/497496 (owner: 10Jbond) [15:51:28] (03CR) 10Ottomata: [C: 03+1] yarn: allow the configuration of maximum app ids retained in HDFS/Zookeeper [puppet/cdh] - 10https://gerrit.wikimedia.org/r/497702 (https://phabricator.wikimedia.org/T218758) (owner: 10Elukey) [15:52:10] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.e6-upgrade [15:52:11] gehel@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [15:55:50] !log gehel@cumin1001 END (PASS) - Cookbook sre.elasticsearch.e6-upgrade (exit_code=0) [15:55:51] gehel@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [15:59:07] (03PS8) 10Jbond: debdeploy: add config to filter out services [puppet] - 10https://gerrit.wikimedia.org/r/497481 [16:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate Morning SWAT (Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190320T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:06:51] (03PS3) 10Jbond: Add option to filter out services which don't actually need a restart [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/497496 [16:08:29] (03CR) 10Jbond: Add option to filter out services which don't actually need a restart (033 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/497496 (owner: 10Jbond) [16:25:40] !log mkdir /srv/dumps/xmldatadumps/public/other/rook for T218587 (fyi apergos) [16:25:41] chasemp: Failed to log message to wiki. Somebody should check the error logs. [16:25:52] eh? [16:26:04] what is it, where is it being made? [16:27:07] there's a bug, wikitech strict ldap mode [16:27:16] em, a task in phab already [16:32:38] (03CR) 10Alex Monk: [C: 03+1] acme_chief: Update acme_chief::cert resource to fetch several cert versions [puppet] - 10https://gerrit.wikimedia.org/r/496148 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [16:44:47] !log disable lldp on asw2-a-eqiad:ge-8/0/10 [16:44:47] XioNoX: Failed to log message to wiki. Somebody should check the error logs. [17:13:44] PROBLEM - Long running screen/tmux on notebook1003 is CRITICAL: CRIT: Long running SCREEN process. (user: fsalutari PID: 15499, 1729431s 1728000s). [17:17:14] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [17:25:51] (03CR) 10Jbond: "> Patch Set 5: Code-Review+1" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/497767 (owner: 10Jbond) [17:28:06] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wikireplica_dns.yaml: Depool dbproxy1011 [puppet] - 10https://gerrit.wikimedia.org/r/497793 (owner: 10Marostegui) [17:33:10] 10Operations, 10monitoring, 10Patch-For-Review, 10Wikimedia-Incident: INCIDENT: k8s@codfw prometheus queries disabled -- very slow to execute some queries - https://phabricator.wikimedia.org/T217715 (10fgiunchedi) 05Open→03Resolved Resolving since the mitigations in place have been working as expected... [17:34:01] (03CR) 10Jbond: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/497767 (owner: 10Jbond) [17:38:08] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: move mediawiki syslogs to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/497570 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron) [17:40:54] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] backups: Make rentention policy configurable [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/496714 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [17:45:19] (03CR) 10Filippo Giunchedi: [C: 03+1] eqiad-prod: 0 weight to ms-be1043/sdk1 [software/swift-ring] - 10https://gerrit.wikimedia.org/r/497557 (https://phabricator.wikimedia.org/T218544) (owner: 10CDanis) [17:45:46] (03PS4) 10Jforrester: TestCommons: Enable federation of Wikidata items and properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485082 (https://phabricator.wikimedia.org/T214075) [17:46:47] 10Operations, 10ops-eqiad, 10Operations-Software-Development, 10monitoring, 10Patch-For-Review: ms-be1043 sdk failed - https://phabricator.wikimedia.org/T218544 (10fgiunchedi) >>! In T218544#5036387, @CDanis wrote: > @fgiunchedi we should set this device to 0 weight in the rings, yes? Happy to do the ch... [17:57:51] (03CR) 10Cwhite: [C: 03+1] logstash: move mediawiki syslogs to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/497570 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron) [17:58:14] (03PS2) 10Herron: logstash: move mediawiki syslogs to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/497570 (https://phabricator.wikimedia.org/T213899) [17:59:25] (03CR) 10Herron: [C: 03+2] logstash: move mediawiki syslogs to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/497570 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron) [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190320T1800) [18:00:09] (03CR) 10Filippo Giunchedi: "nits inline but overall LGTM" (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/496776 (https://phabricator.wikimedia.org/T215277) (owner: 10Jbond) [18:15:47] (03PS7) 10Jbond: Create and mtail parser for ulogd and install it on the syslog server [puppet] - 10https://gerrit.wikimedia.org/r/496776 (https://phabricator.wikimedia.org/T215277) [18:17:07] (03CR) 10Jbond: "Thanks Filippo, nits fixed" (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/496776 (https://phabricator.wikimedia.org/T215277) (owner: 10Jbond) [18:20:02] (03CR) 10Jforrester: "Addshore says it looks good to him." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485082 (https://phabricator.wikimedia.org/T214075) (owner: 10Jforrester) [18:26:06] !log zfilipin@deploy1001 Started scap: testwiki to php-1.33.0-wmf.22 and rebuild l10n cache [18:26:06] zfilipin@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [18:37:52] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [18:37:52] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [18:37:54] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [18:37:54] !log otto@deploy1001 scap-helm eventgate-analytics finished [18:37:54] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [18:37:54] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [18:39:34] !log otto@deploy1001 scap-helm eventgate-analytics upgrade production -f eventgate-analytics-codfw-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: codfw] [18:39:34] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [18:39:37] !log otto@deploy1001 scap-helm eventgate-analytics cluster codfw completed [18:39:37] !log otto@deploy1001 scap-helm eventgate-analytics finished [18:39:37] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [18:39:38] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [18:39:38] !log otto@deploy1001 scap-helm eventgate-analytics upgrade production -f eventgate-analytics-eqiad-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: eqiad] [18:39:38] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [18:39:40] !log otto@deploy1001 scap-helm eventgate-analytics cluster eqiad completed [18:39:40] !log otto@deploy1001 scap-helm eventgate-analytics finished [18:39:40] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [18:39:41] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [18:40:26] 10Operations, 10Parsoid-PHP: Install PHP7 on scandium - https://phabricator.wikimedia.org/T213493 (10Neil_P._Quinn_WMF) [18:46:43] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs - https://phabricator.wikimedia.org/T213899 (10herron) [18:49:09] !log hitting eventgate-analytics in eqiad with ab [18:49:09] ottomata: Failed to log message to wiki. Somebody should check the error logs. [18:49:13] ...o right [18:50:12] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [18:50:36] !log restarting pdfrender on scb1003 [18:50:37] jijiki: Failed to log message to wiki. Somebody should check the error logs. [18:50:42] yeah whatever [18:50:46] haha [18:51:17] :p [18:52:36] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time https://phabricator.wikimedia.org/T174916 [18:56:31] (03PS1) 10Effie Mouzeli: admin: add perf-roots to admin::groups for mwmaint* [puppet] - 10https://gerrit.wikimedia.org/r/497840 (https://phabricator.wikimedia.org/T217813) [18:56:32] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 52.20, 32.58, 22.88 [18:57:00] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [18:57:34] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 66.90, 38.09, 24.98 [19:00:04] Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190320T1900) [19:00:10] (03CR) 10Volans: [C: 03+1] "LGTM, but please coordinate also with someone in WMCS when merging it to keep an eye that nothing major breaks there too." [puppet] - 10https://gerrit.wikimedia.org/r/497767 (owner: 10Jbond) [19:00:40] PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 58.37, 32.35, 21.60 [19:01:04] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 51.95, 31.79, 21.54 [19:03:28] (03PS12) 10Jcrespo: mariadb-snapshots: Better error and logging handling [puppet] - 10https://gerrit.wikimedia.org/r/496746 (https://phabricator.wikimedia.org/T210292) [19:03:30] (03PS4) 10Jcrespo: mariadb-backups: Make some dump options explicit [puppet] - 10https://gerrit.wikimedia.org/r/497708 (https://phabricator.wikimedia.org/T206203) [19:03:32] (03PS1) 10Jcrespo: mariadb-backups: Fix missing --retention param for backup_mariadb.py [puppet] - 10https://gerrit.wikimedia.org/r/497843 (https://phabricator.wikimedia.org/T210292) [19:03:38] RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 15.04, 24.69, 20.46 [19:04:00] (03Abandoned) 10Jcrespo: mariadb-backups: Fix missing --retention param for backup_mariadb.py [puppet] - 10https://gerrit.wikimedia.org/r/497843 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [19:04:35] !log zfilipin@deploy1001 Finished scap: testwiki to php-1.33.0-wmf.22 and rebuild l10n cache (duration: 38m 29s) [19:04:36] zfilipin@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [19:05:46] RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 15.67, 24.29, 21.68 [19:06:16] (03CR) 10Jcrespo: [C: 03+2] mariadb-snapshots: Better error and logging handling [puppet] - 10https://gerrit.wikimedia.org/r/496746 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [19:06:34] (03PS1) 10Ottomata: eventgate-analytics Set rdkafka log.connection.close: false [deployment-charts] - 10https://gerrit.wikimedia.org/r/497844 [19:06:57] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analytics Set rdkafka log.connection.close: false [deployment-charts] - 10https://gerrit.wikimedia.org/r/497844 (owner: 10Ottomata) [19:07:13] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Make some dump options explicit [puppet] - 10https://gerrit.wikimedia.org/r/497708 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [19:10:20] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [19:10:20] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [19:10:22] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 12.93, 18.43, 23.23 [19:12:44] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:13:10] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 13.41, 17.88, 23.26 [19:13:26] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [19:13:26] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [19:13:33] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [19:13:33] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [19:13:35] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [19:13:35] !log otto@deploy1001 scap-helm eventgate-analytics finished [19:13:35] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [19:13:36] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [19:20:18] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:21:50] (03PS1) 10Eevans: prometheus: collect session storaage Cassandra metrics [puppet] - 10https://gerrit.wikimedia.org/r/497848 (https://phabricator.wikimedia.org/T209108) [19:24:06] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:24:10] (03PS2) 10Eevans: prometheus: collect session storaage Cassandra metrics [puppet] - 10https://gerrit.wikimedia.org/r/497848 (https://phabricator.wikimedia.org/T209108) [19:26:21] 10Operations, 10serviceops, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Kanban (Doing), and 4 others: Session storage Cassandra cluster configuration - https://phabricator.wikimedia.org/T215883 (10Eevans) [19:26:48] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:29:03] (03PS3) 10Alex Monk: [WIP] acme-chief: Add script for Designate integration [puppet] - 10https://gerrit.wikimedia.org/r/497670 (https://phabricator.wikimedia.org/T206922) [19:30:16] (03CR) 10jerkins-bot: [V: 04-1] [WIP] acme-chief: Add script for Designate integration [puppet] - 10https://gerrit.wikimedia.org/r/497670 (https://phabricator.wikimedia.org/T206922) (owner: 10Alex Monk) [19:30:22] (03CR) 10Joal: [C: 03+1] yarn: allow the configuration of maximum app ids retained in HDFS/Zookeeper [puppet/cdh] - 10https://gerrit.wikimedia.org/r/497702 (https://phabricator.wikimedia.org/T218758) (owner: 10Elukey) [19:31:08] if there are no complaints, I'll try to deploy wmf.22 to group 0 [19:34:17] (03CR) 10Zfilipin: [C: 03+2] Group0 to 1.33.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497765 (owner: 10Zfilipin) [19:34:32] (03PS4) 10Alex Monk: [WIP] acme-chief: Add script for Designate integration [puppet] - 10https://gerrit.wikimedia.org/r/497670 (https://phabricator.wikimedia.org/T206922) [19:35:19] (03Merged) 10jenkins-bot: Group0 to 1.33.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497765 (owner: 10Zfilipin) [19:35:21] (03PS1) 10Jcrespo: mariadb-snapshots: Fix snapshot statistics db credentials [puppet] - 10https://gerrit.wikimedia.org/r/497853 (https://phabricator.wikimedia.org/T210292) [19:35:38] (03CR) 10jerkins-bot: [V: 04-1] [WIP] acme-chief: Add script for Designate integration [puppet] - 10https://gerrit.wikimedia.org/r/497670 (https://phabricator.wikimedia.org/T206922) (owner: 10Alex Monk) [19:37:13] (03PS5) 10Alex Monk: [WIP] acme-chief: Add script for Designate integration [puppet] - 10https://gerrit.wikimedia.org/r/497670 (https://phabricator.wikimedia.org/T206922) [19:38:18] (03CR) 10jerkins-bot: [V: 04-1] [WIP] acme-chief: Add script for Designate integration [puppet] - 10https://gerrit.wikimedia.org/r/497670 (https://phabricator.wikimedia.org/T206922) (owner: 10Alex Monk) [19:38:26] !log zfilipin@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.33.0-wmf.22 [19:38:26] zfilipin@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [19:40:20] (03PS6) 10Alex Monk: [WIP] acme-chief: Add script for Designate integration [puppet] - 10https://gerrit.wikimedia.org/r/497670 (https://phabricator.wikimedia.org/T206922) [19:45:21] (03CR) 10Jcrespo: [C: 03+2] mariadb-snapshots: Fix snapshot statistics db credentials [puppet] - 10https://gerrit.wikimedia.org/r/497853 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [20:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Services – Parsoid / Citoid / Mobileapps / ORES / …. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190320T2000). [20:07:35] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [20:07:36] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [20:07:37] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [20:07:38] !log otto@deploy1001 scap-helm eventgate-analytics finished [20:07:38] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [20:07:38] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [20:12:15] (03CR) 10CDanis: [V: 03+2 C: 03+2] eqiad-prod: 0 weight to ms-be1043/sdk1 [software/swift-ring] - 10https://gerrit.wikimedia.org/r/497557 (https://phabricator.wikimedia.org/T218544) (owner: 10CDanis) [20:13:02] !log otto@deploy1001 scap-helm eventgate-analytics upgrade production -f eventgate-analytics-eqiad-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: eqiad] [20:13:03] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [20:13:04] !log otto@deploy1001 scap-helm eventgate-analytics cluster eqiad completed [20:13:04] !log otto@deploy1001 scap-helm eventgate-analytics finished [20:13:05] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [20:13:05] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [20:13:08] !log otto@deploy1001 scap-helm eventgate-analytics upgrade production -f eventgate-analytics-codfw-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: codfw] [20:13:08] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [20:13:11] !log otto@deploy1001 scap-helm eventgate-analytics cluster codfw completed [20:13:11] !log otto@deploy1001 scap-helm eventgate-analytics finished [20:13:11] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [20:13:11] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [20:14:02] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:14:18] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [20:15:20] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:15:32] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [20:19:25] wmf.22 at group 0 seems ok, deploying to group 1 [20:20:12] (03PS1) 10Zfilipin: group1 wikis to 1.33.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497859 [20:20:13] (03CR) 10Zfilipin: [C: 03+2] group1 wikis to 1.33.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497859 (owner: 10Zfilipin) [20:21:29] (03Merged) 10jenkins-bot: group1 wikis to 1.33.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497859 (owner: 10Zfilipin) [20:23:12] !log zfilipin@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.33.0-wmf.22 [20:23:14] zfilipin@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [20:24:59] !log zfilipin@deploy1001 Synchronized php: group1 wikis to 1.33.0-wmf.22 (duration: 01m 46s) [20:25:00] zfilipin@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [20:31:04] zeljkof: Can I do a config deployment, or do you want to keep things clear for a bit longer? [20:32:05] James_F: I would prefer if you could do it during US swat (read: while I'm sleeping) :) [20:32:10] 10Operations, 10Analytics, 10Analytics-Kanban, 10EventBus, and 3 others: eventgate-analytics k8s pods occasionally can't produce to kafka - https://phabricator.wikimedia.org/T218268 (10Ottomata) a:03Ottomata [20:32:28] if it's urgent, go ahead, if it's not please wait, I've just deployed to group 1, still looking at logs [20:33:00] zeljkof: I'll wait until after you're done. :-) [20:33:27] thanks :) [20:38:48] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [20:40:04] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [20:44:29] (03PS1) 10BryanDavis: wikitech: Lock LDAP accounts when users are blocked [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497866 (https://phabricator.wikimedia.org/T168692) [20:49:04] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [20:56:46] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [21:00:27] (03CR) 10Mobrovac: [C: 03+1] prometheus: collect session storaage Cassandra metrics [puppet] - 10https://gerrit.wikimedia.org/r/497848 (https://phabricator.wikimedia.org/T209108) (owner: 10Eevans) [21:00:50] !log apply icmp redirect on cr1-codfw:xe-5/0/2 (to cr4-ulsfo) for test IP 208.80.154.225 - T190090 [21:01:05] XioNoX: Failed to log message to wiki. Somebody should check the error logs. [21:01:06] T190090: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090 [21:08:45] (03PS1) 10Herron: wip: apache2: add generalized global IP address ban [puppet] - 10https://gerrit.wikimedia.org/r/497878 (https://phabricator.wikimedia.org/T218784) [21:10:01] (03CR) 10jerkins-bot: [V: 04-1] wip: apache2: add generalized global IP address ban [puppet] - 10https://gerrit.wikimedia.org/r/497878 (https://phabricator.wikimedia.org/T218784) (owner: 10Herron) [21:11:21] (03PS2) 10Herron: wip: apache2: add generalized global IP address ban [puppet] - 10https://gerrit.wikimedia.org/r/497878 (https://phabricator.wikimedia.org/T218784) [21:16:02] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090 (10ayounsi) Typo above, test IP is 208.80.15**3**.225. Successfully tested on 1 link with: `cr4-ulsfo> ping source 129.250.204.6 208.80.153.225 ` Pushing the change to the ot... [21:29:47] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090 (10ayounsi) `name=cr2-codfw,lang=diff [edit interfaces xe-5/0/0] - description "Core: cr2-eqdfw:xe-0/1/4 (CyrusOne wikimedia:ix2.dfw4_to_ix2.dfw5.245.0009) {#11403} [10Gbps... [21:34:23] !log apply transit-in4 term offload-ping4 with test IP to cr2-codfw [21:34:23] XioNoX: Failed to log message to wiki. Somebody should check the error logs. [21:37:44] !log apply transit-in4 term offload-ping4 with test IP to cr1/2-codfw - T190090 [21:37:46] XioNoX: Failed to log message to wiki. Somebody should check the error logs. [21:37:47] T190090: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090 [21:41:59] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [21:48:08] (03PS3) 10Herron: wip: apache2: add generalized global IP address ban [puppet] - 10https://gerrit.wikimedia.org/r/497878 (https://phabricator.wikimedia.org/T218784) [21:48:28] (03PS1) 10Alex Monk: [WIP] Allow acme-chief to provide unified cert [puppet] - 10https://gerrit.wikimedia.org/r/497929 [21:49:41] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Allow acme-chief to provide unified cert [puppet] - 10https://gerrit.wikimedia.org/r/497929 (owner: 10Alex Monk) [21:51:42] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090 (10ayounsi) Next step is to apply the following to replace the test IP with codfw text-lb IP. `lang=diff [edit firewall family inet filter transport-in4 term no-offload-ping4... [21:52:20] (03PS1) 10Cwhite: httpd: featurize mod-security for use with httpd [puppet] - 10https://gerrit.wikimedia.org/r/497930 (https://phabricator.wikimedia.org/T218784) [21:56:16] (03PS2) 10Alex Monk: [WIP] Allow acme-chief to provide unified cert [puppet] - 10https://gerrit.wikimedia.org/r/497929 [21:57:24] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [22:06:06] (03PS3) 10Alex Monk: [WIP] Allow acme-chief to provide unified cert [puppet] - 10https://gerrit.wikimedia.org/r/497929 [22:06:24] (03PS4) 10Herron: wip: apache2: add generalized global IP address ban [puppet] - 10https://gerrit.wikimedia.org/r/497878 (https://phabricator.wikimedia.org/T218784) [22:17:43] I'm going to deploy a config change for TestCommons. [22:18:01] (03CR) 10Jforrester: [C: 03+2] TestCommons: Enable federation of Wikidata items and properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485082 (https://phabricator.wikimedia.org/T214075) (owner: 10Jforrester) [22:19:27] (03Merged) 10jenkins-bot: TestCommons: Enable federation of Wikidata items and properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485082 (https://phabricator.wikimedia.org/T214075) (owner: 10Jforrester) [22:21:28] (03CR) 10Alex Monk: acme_chief: Add security::access::config on passive host if realm == labs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/497430 (owner: 10Alex Monk) [22:26:31] (03PS2) 10Cwhite: httpd: featurize mod-security for use with httpd [puppet] - 10https://gerrit.wikimedia.org/r/497930 (https://phabricator.wikimedia.org/T218784) [22:29:11] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T214075 Enable federation of Wikidata items and properties on Test Commons (duration: 00m 57s) [22:29:13] jforrester@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [22:29:14] T214075: Enable federated access to entities and properties from Wikidata to Commons - https://phabricator.wikimedia.org/T214075 [22:31:55] (03PS5) 10Herron: apache2: add generalized global IP address ban [puppet] - 10https://gerrit.wikimedia.org/r/497878 (https://phabricator.wikimedia.org/T218784) [22:35:25] (03PS3) 10Cwhite: httpd: featurize mod-security for use with httpd [puppet] - 10https://gerrit.wikimedia.org/r/497930 (https://phabricator.wikimedia.org/T218784) [22:41:24] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1023 is OK: OK: Less than 50.00% above the threshold [1.0] https://grafana.wikimedia.org/d/000000523/kafka-graphite?refresh=5m&panelId=29&fullscreen&orgId=1 [22:45:35] (03PS1) 10Herron: add fake secret for waf ipaddress banlist [labs/private] - 10https://gerrit.wikimedia.org/r/497942 [22:47:11] (03PS4) 10Cwhite: httpd: featurize mod-security for use with httpd [puppet] - 10https://gerrit.wikimedia.org/r/497930 (https://phabricator.wikimedia.org/T218784) [22:48:13] (03CR) 10Herron: [V: 03+2 C: 03+2] add fake secret for waf ipaddress banlist [labs/private] - 10https://gerrit.wikimedia.org/r/497942 (owner: 10Herron) [22:55:23] (03PS1) 10Herron: fix typo in global_ipaddress_banlist filename [labs/private] - 10https://gerrit.wikimedia.org/r/497944 [22:55:35] (03CR) 10Herron: [V: 03+2 C: 03+2] fix typo in global_ipaddress_banlist filename [labs/private] - 10https://gerrit.wikimedia.org/r/497944 (owner: 10Herron) [22:56:17] Wait, so labs/private is public? [22:56:44] That doesnt seem rigjt [22:56:46] Right* [22:57:15] I mean its labs, so maybe that's ok and its just an analogy to the private repo for prod [22:57:30] yeah, it’s confusing naming wise but is expected to be public [22:57:58] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic, 10HTTPS: https://sv.wikipedia.beta.wmflabs.org/ has invalid certificate - https://phabricator.wikimedia.org/T202564 (10Krenair) works now with some puppet cherry-picks [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a Evening SWAT (Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190320T2300). [23:00:04] Smalyshev: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:16] here [23:00:25] Guess I can do it [23:00:57] the train is on group 1, so it seems to be ok to merge [23:01:19] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work): Migrate mjolnir to stdout/syslog/cee logging output - https://phabricator.wikimedia.org/T218833 (10EBernhardson) [23:02:44] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [23:02:57] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work): Migrate mjolnir to stdout/syslog/cee logging output - https://phabricator.wikimedia.org/T218833 (10EBernhardson) [23:03:11] (03PS5) 10MaxSem: Enable WikibaseCirrusSearch on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490648 (https://phabricator.wikimedia.org/T215684) (owner: 10Jforrester) [23:03:17] (03CR) 10MaxSem: [C: 03+2] Enable WikibaseCirrusSearch on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490648 (https://phabricator.wikimedia.org/T215684) (owner: 10Jforrester) [23:03:43] (03PS1) 10Cwhite: httpd: generate require not ip rules for RequireAll directive [puppet] - 10https://gerrit.wikimedia.org/r/497946 (https://phabricator.wikimedia.org/T218784) [23:04:17] (03Merged) 10jenkins-bot: Enable WikibaseCirrusSearch on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490648 (https://phabricator.wikimedia.org/T215684) (owner: 10Jforrester) [23:05:17] SMalyshev: pulled on mwdebug1002 [23:05:44] chekcing [23:06:17] (03CR) 10Herron: "PCC looks good https://puppet-compiler.wmflabs.org/compiler1002/15237/people1001.eqiad.wmnet/ and have tested the ruleset as well in labs." [puppet] - 10https://gerrit.wikimedia.org/r/497878 (https://phabricator.wikimedia.org/T218784) (owner: 10Herron) [23:06:36] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 7.790 second response time https://phabricator.wikimedia.org/T174916 [23:08:10] SMalyshev: if we are enabling on wikidata, would we enable on commons too? [23:08:32] SMalyshev: Would feel better about merging the WikibaseMediaInfo patch to use WBCS if it was on commons [23:08:33] Yeah, there's a second patch for it [23:08:36] ahh, ok :) [23:08:39] seems to be working ok to me [23:09:06] ebernhardson: yeah I did it in two patches just in case commons has something that breaks something... [23:09:20] sounds reasonable [23:09:25] MaxSem: I think you can deploy the first one everywhere [23:10:32] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [23:10:55] !log maxsem@deploy1001 Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/490648/ (duration: 00m 56s) [23:10:55] maxsem@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [23:11:25] SMalyshev: ^ [23:11:50] is that something wrong? [23:12:23] Nah, logging was broken for days [23:12:50] ok then, cool [23:12:55] let's do the second one :) [23:13:11] (03PS11) 10Jbond: Move qualified parameters to there correct location [puppet] - 10https://gerrit.wikimedia.org/r/497767 [23:13:30] I've got a fix to a UBN task, BTW, so once SWAT is done I can sling that out. [23:14:11] SMalyshev: rebase conflict [23:14:23] James_F: how much time do you need? [23:14:31] hmm that's weird. let me check [23:16:12] (03CR) 10Jbond: "> Patch Set 10: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/497767 (owner: 10Jbond) [23:18:07] 10Operations, 10ops-eqiad: Degraded RAID on sodium - https://phabricator.wikimedia.org/T212010 (10RobH) Ok, in reviewing the ordering task of T137132, it isn't 100% clear at first review what was ordered (packing slip only has the second page attached to task, and the task has a lot of different specs on it.)... [23:18:19] 10Operations, 10ops-eqiad: Degraded RAID on sodium - https://phabricator.wikimedia.org/T212010 (10RobH) a:05Cmjohnson→03RobH [23:18:44] ! [remote rejected] HEAD -> refs/publish/master/wbsc-enable (cannot add patch set to 490649.) [23:18:46] hmm [23:18:51] what the what? [23:19:10] (03PS1) 10Smalyshev: Enable WikibaseCirrusSearch on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497948 (https://phabricator.wikimedia.org/T215684) [23:19:11] ok I'll make the new one [23:19:13] oh [23:19:13] Yep, git loves you too :} [23:19:22] SMalyshev yeh, that's known [23:19:43] paladox: security measures? [23:19:49] yup, i guess. [23:20:06] MaxSem: use https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/497948 instead [23:20:09] the right was yanked from all users. [23:20:16] (03CR) 10MaxSem: [C: 03+2] Enable WikibaseCirrusSearch on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497948 (https://phabricator.wikimedia.org/T215684) (owner: 10Smalyshev) [23:21:07] (03CR) 10Herron: [C: 03+2] apache2: add generalized global IP address ban [puppet] - 10https://gerrit.wikimedia.org/r/497878 (https://phabricator.wikimedia.org/T218784) (owner: 10Herron) [23:21:12] (03Merged) 10jenkins-bot: Enable WikibaseCirrusSearch on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497948 (https://phabricator.wikimedia.org/T215684) (owner: 10Smalyshev) [23:22:00] SMalyshev: pulled on mwdebug1002 [23:22:07] checking [23:22:20] paladox, surely SMalyshev would be in a group that permits that? [23:22:26] Nope [23:22:33] unless he is a admin [23:22:55] MaxSem: Still waiting for the fix to land in master. [23:22:56] though i think we can try and re add it at some point. [23:23:00] remind me what the restriction is against [23:23:05] adding patch sets to others changes? [23:23:16] Krenair yup [23:23:29] surely that can be opened up far beyond admins [23:23:44] paladox: well, I'd certainly want the right to change patches in some repos... [23:23:44] It can, but the right was yanked for obvous reasons :) [23:23:49] from all users sure [23:23:55] rebasing/touching up somebody else's patch is common [23:23:57] but there's plenty of groups that can be trusted with that [23:24:17] yup [23:26:16] MaxSem: nothing seems to be broken... [23:26:39] Ok, breaking for realz then [23:26:47] though inlabel: for some reason doesn't work, but it may not be enabled... not sure. [23:27:08] ebernhardson: do you know if inlabel: is supposed to work on commons or I need some additional things? [23:28:23] !log maxsem@deploy1001 Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/497948/ (duration: 00m 56s) [23:28:24] maxsem@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [23:28:32] SMalyshev: ^ [23:28:39] MaxSem: thanks! [23:29:23] James_F: all yours [23:30:00] Thanks. [23:30:10] Still waiting for the master patch to merge, however. :-( [23:30:56] (03PS4) 10Alex Monk: [WIP] Allow acme-chief to provide unified cert [puppet] - 10https://gerrit.wikimedia.org/r/497929 [23:31:35] cscott: can https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/497941/ be deployed right now? [23:31:55] MaxSem: Yeah, I'll do those afterwards. [23:32:03] Weee [23:32:26] But https://gerrit.wikimedia.org/r/c/mediawiki/core/+/497949 is actually UBN. No block reasons working on all of group1 right now. [23:32:38] So you can't justify blocking someone on meta.:-( [23:32:48] Also other functionality issues. [23:33:06] Pffff, genrations upon generations of our ancestors just typed the reason [23:33:24] Sure, but generations of our sysops are newbies from my POV. [23:33:33] (03CR) 10Smalyshev: [C: 03+1] "Merged as https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/497948" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490649 (https://phabricator.wikimedia.org/T215684) (owner: 10Jforrester) [23:33:40] And JS fatals in core admin tools are not ideal. [23:33:43] James_F: you can abandon https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/490649 now - since we made another one instead [23:33:46] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Allow acme-chief to provide unified cert [puppet] - 10https://gerrit.wikimedia.org/r/497929 (owner: 10Alex Monk) [23:34:09] Oh, sure. Why not just rebase? [23:34:24] (03Abandoned) 10Jforrester: Enable WikibaseCirrusSearch on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490649 (https://phabricator.wikimedia.org/T215684) (owner: 10Jforrester) [23:42:27] James_F: because it doesn't let me to rebase :) [23:42:39] that's what I was just asking about [23:43:31] SMalyshev: Did you try via the UI? [23:43:45] yeah UI has no rebase button at all [23:43:52] (had of course) [23:43:54] Oh, how odd. [23:44:10] (03PS2) 10Jforrester: scap: add logging to clean > prune-git-branches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497781 (owner: 10Hashar) [23:44:17] That's done on purpose i think. [23:44:20] E.g. ^^ WFM. [23:44:25] it used to be that I could rebase it manually by submitting new patch, but not anymore [23:44:27] James_F are you a admin? [23:44:29] SMalyshev: Can you file a Phab task? [23:44:33] paladox: No. [23:44:37] (03PS1) 10Herron: change apache modsec ip ban list action to drop [puppet] - 10https://gerrit.wikimedia.org/r/497954 (https://phabricator.wikimedia.org/T218784) [23:44:40] hmm [23:44:49] James_F: sure thing [23:44:56] "Project Owners" has the rebase right [23:46:16] Oh, right, SMalyshev you're not a member of https://gerrit.wikimedia.org/r/admin/groups/3fdcf8fd0d569e90a3e9b39788a29f2c50d33be9,members so you can't fiddle with other people's patches in that repo any more. [23:46:38] James_F: yeah I guess that's the reason [23:46:41] (wmf-deployment) [23:47:08] (03CR) 10Herron: [C: 03+2] change apache modsec ip ban list action to drop [puppet] - 10https://gerrit.wikimedia.org/r/497954 (https://phabricator.wikimedia.org/T218784) (owner: 10Herron) [23:47:12] (03PS1) 10Ejegg: Load FundraisingTranslateWorkflow after Translate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497956 (https://phabricator.wikimedia.org/T213943) [23:49:10] hi ops folks! Jforrester noted that a recent update to FundraisingTranslateWorkflow may be breaking beta sync [23:49:19] I think that config change might fix it ^^^ [23:49:24] !log jforrester@deploy1001 Synchronized php-1.33.0-wmf.22/resources/lib/ooui/oojs-ui-core.js: SWAT T218722 T218830 Bring forward UBN OOUI fix (duration: 00m 57s) [23:49:27] jforrester@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [23:49:28] T218830: selectandother widget broken, particular on Special:GlobalBlock since 1.33.0-wmf.22 - https://phabricator.wikimedia.org/T218830 [23:49:28] T218722: DropdownInputWidget with MenuSectionOptionWidget triggers endless change events - https://phabricator.wikimedia.org/T218722 [23:49:31] ejegg: Thanks, will have a look. [23:49:46] :) [23:51:58] (03CR) 10Jforrester: [C: 03+2] Load FundraisingTranslateWorkflow after Translate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497956 (https://phabricator.wikimedia.org/T213943) (owner: 10Ejegg) [23:53:25] (03PS5) 10Alex Monk: [WIP] Allow acme-chief to provide unified cert [puppet] - 10https://gerrit.wikimedia.org/r/497929 [23:54:10] (03Merged) 10jenkins-bot: Load FundraisingTranslateWorkflow after Translate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497956 (https://phabricator.wikimedia.org/T213943) (owner: 10Ejegg) [23:54:21] (03PS2) 10Rush: httpd: subconfig for client handling [puppet] - 10https://gerrit.wikimedia.org/r/497946 (owner: 10Cwhite) [23:54:44] (03PS3) 10Rush: httpd: subconfig for client handling [puppet] - 10https://gerrit.wikimedia.org/r/497946 (owner: 10Cwhite)