[00:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Evening SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181204T0000). [00:00:05] No GERRIT patches in the queue for this window AFAICS. [00:01:42] (03PS1) 10Bstorm: sonofgridengine: stop the gridengine-master service on shadow nodes [puppet] - 10https://gerrit.wikimedia.org/r/477437 (https://phabricator.wikimedia.org/T200557) [00:03:32] (03CR) 10Bstorm: [C: 032] sonofgridengine: stop the gridengine-master service on shadow nodes [puppet] - 10https://gerrit.wikimedia.org/r/477437 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [00:04:05] 10Operations, 10ops-codfw, 10Patch-For-Review, 10Services (watching), 10User-fgiunchedi: rack/setup/install restbase201[3-8].codfw.wmnet - https://phabricator.wikimedia.org/T209615 (10RobH) @fgiunchedi: Can you advise if these are fully online, and if so, can we start to proceed on the #decom of the olde... [00:05:59] (03CR) 10Dzahn: [C: 04-1] phabricator: add data types to all parameters (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/471325 (owner: 10Dzahn) [00:06:24] (03PS13) 10Dzahn: phabricator: add data types to all parameters [puppet] - 10https://gerrit.wikimedia.org/r/471325 [00:09:03] (03PS3) 10Smalyshev: Enable SPARQL logging to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/477429 (https://phabricator.wikimedia.org/T210044) [00:18:49] PROBLEM - tilerator on maps1003 is CRITICAL: connect to address 10.64.32.117 and port 6534: Connection refused [00:24:36] (03PS1) 10Bstorm: sonofgridengine: remove weird accounting link [puppet] - 10https://gerrit.wikimedia.org/r/477446 (https://phabricator.wikimedia.org/T200557) [00:28:17] (03PS14) 10Paladox: phabricator: add data types to all parameters [puppet] - 10https://gerrit.wikimedia.org/r/471325 (owner: 10Dzahn) [00:28:23] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/471325 (owner: 10Dzahn) [00:28:43] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10User-Elukey: rack/setup/install cloudvirtan100[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T207194 (10RobH) [00:33:49] PROBLEM - tilerator on maps1002 is CRITICAL: connect to address 10.64.16.42 and port 6534: Connection refused [00:34:19] PROBLEM - tilerator on maps1001 is CRITICAL: connect to address 10.64.0.79 and port 6534: Connection refused [01:22:26] !log Reset password for user "Orangemike" [01:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:22:39] !log Removing 2FA per request at https://phabricator.wikimedia.org/T210703 [01:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:35:53] (03PS1) 10BPirkle: Create script to intentionally trigger fatal errors in MediaWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477450 (https://phabricator.wikimedia.org/T210567) [01:37:01] (03CR) 10jerkins-bot: [V: 04-1] Create script to intentionally trigger fatal errors in MediaWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477450 (https://phabricator.wikimedia.org/T210567) (owner: 10BPirkle) [01:41:17] (03PS2) 10BPirkle: Create script to intentionally trigger fatal errors in MediaWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477450 (https://phabricator.wikimedia.org/T210567) [01:44:41] (03CR) 10BPirkle: [C: 04-1] "Posting patchset for discussion. At a minimum, we still need to decide on how authentication will be handled." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477450 (https://phabricator.wikimedia.org/T210567) (owner: 10BPirkle) [01:53:35] PROBLEM - puppet last run on labvirt1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/apparmor.d/abstractions/ssl_certs] [02:24:45] RECOVERY - puppet last run on labvirt1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [02:47:03] (03CR) 10MR70: "I think this code will not work well in some scenarios, as if there was a user in the sysop and eliminator groups, he would not be able to" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476503 (https://phabricator.wikimedia.org/T210642) (owner: 10Huji) [02:49:21] (03CR) 10MR70: [C: 04-1] Dissallow eliminators to block certain groups on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476503 (https://phabricator.wikimedia.org/T210642) (owner: 10Huji) [02:58:16] (03CR) 10Huji: "In the particular case of fawiki, a user cannot be in both sysop and eliminator groups." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476503 (https://phabricator.wikimedia.org/T210642) (owner: 10Huji) [03:31:20] (03PS1) 10Dzahn: wikistats: fix xml dump cron jobs by specifying defaults-extra-file [puppet] - 10https://gerrit.wikimedia.org/r/477451 (https://phabricator.wikimedia.org/T200447) [03:34:30] (03CR) 10Ottomata: "Hm ok." [puppet] - 10https://gerrit.wikimedia.org/r/477419 (https://phabricator.wikimedia.org/T205437) (owner: 10Gergő Tisza) [03:34:46] (03PS3) 10Ottomata: EventLogging Logstash filter: move useful fields out of event [puppet] - 10https://gerrit.wikimedia.org/r/477419 (https://phabricator.wikimedia.org/T205437) (owner: 10Gergő Tisza) [03:36:49] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 973.04 seconds [03:45:03] (03CR) 10Legoktm: Create script to intentionally trigger fatal errors in MediaWiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477450 (https://phabricator.wikimedia.org/T210567) (owner: 10BPirkle) [04:10:23] PROBLEM - High load average on labstore1007 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [24.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [04:11:29] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 204.97 seconds [04:35:09] (03PS1) 10Bstorm: dumps distribution: fail dumps web address back to labstore1006 [dns] - 10https://gerrit.wikimedia.org/r/477453 [04:36:06] (03CR) 10Mathew.onipe: [C: 031] elasticsearch: create base data dir [puppet] - 10https://gerrit.wikimedia.org/r/477314 (owner: 10Gehel) [04:38:00] (03CR) 10Bstorm: [C: 032] dumps distribution: fail dumps web address back to labstore1006 [dns] - 10https://gerrit.wikimedia.org/r/477453 (owner: 10Bstorm) [04:46:39] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10User-Elukey: rack/setup/install cloudvirtan100[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T207194 (10RobH) [04:53:32] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10User-Elukey: rack/setup/install cloudvirtan100[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T207194 (10RobH) a:05RobH>03Cmjohnson Ok, this has had puppet run on all of the hosts. This is now ready for @cmjohnson to attach the othe... [05:20:18] RECOVERY - High load average on labstore1007 is OK: OK: Less than 85.00% above the threshold [16.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [06:07:50] (03PS1) 10Marostegui: db-eqiad.php: Depool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477456 (https://phabricator.wikimedia.org/T86338) [06:09:59] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477456 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [06:11:07] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477456 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [06:12:19] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1110 T86338 T202167 (duration: 00m 53s) [06:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:24] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [06:12:24] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [06:12:26] !log Deploy schema change on db1110 T86338 T202167 [06:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:48] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477456 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [06:21:26] 10Operations, 10DBA, 10Patch-For-Review, 10codfw-rollout: [RFC] improve parsercache replication and sharding handling - https://phabricator.wikimedia.org/T133523 (10Marostegui) [06:23:21] 10Operations, 10Scoring-platform-team (Current), 10User-Ladsgroup, 10Wikimedia-Incident: Celery manager implodes horribly if Redis goes down - https://phabricator.wikimedia.org/T181632 (10Joe) [06:23:26] 10Operations, 10Scoring-platform-team, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538 (10Joe) [06:23:32] 10Operations, 10Scoring-platform-team (Current), 10User-Ladsgroup, 10Wikimedia-Incident: Investigate redis-cluster or other techniques for making Redis not a single point of failure. - https://phabricator.wikimedia.org/T181559 (10Joe) 05Resolved>03Open [06:23:50] 10Operations, 10ops-codfw, 10DBA, 10decommission: Decommission parsercache hosts: pc2004 pc2005 pc2006 (Dec 2018 lease return) - https://phabricator.wikimedia.org/T209858 (10Marostegui) [06:24:49] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1110" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477457 [06:26:13] 10Operations, 10Scoring-platform-team (Current), 10User-Ladsgroup, 10Wikimedia-Incident: Investigate redis-cluster or other techniques for making Redis not a single point of failure. - https://phabricator.wikimedia.org/T181559 (10Joe) Hi @Ladsgroup can you please elaborate on why you decided to go with sen... [06:28:28] PROBLEM - netbox HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 547 bytes in 0.009 second response time [06:28:50] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:31:04] 10Operations, 10ORES, 10vm-requests, 10Scoring-platform-team (Current): New node request: oresrdb[12]003 - https://phabricator.wikimedia.org/T210582 (10Joe) I think we should pause this request until the choices that generated this ticket have been properly discussed with the SRE team. [06:31:37] <_joe_> Amir1: I'm very disappointed with how you all have managed this "Redis HA" discussion [06:31:50] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/bash_autologout.sh] [06:32:17] <_joe_> you can't expect SRE to support whatever backend storage new technology you picked without even the slightest justification on phabricator, or even a proper ping about it during the decision process [06:33:38] <_joe_> that said, let's try to work together, probably redis sentinel is really the best solution in a void, but it's a new technology for data storage to introduce in production, it's not really reasonable to go about it this way [06:35:04] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1110" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477457 (owner: 10Marostegui) [06:36:07] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1110" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477457 (owner: 10Marostegui) [06:37:06] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1110 T86338 T202167 (duration: 00m 49s) [06:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:11] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [06:37:12] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [06:37:14] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational [06:38:04] RECOVERY - netbox HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 302 Found - 348 bytes in 0.563 second response time [06:39:18] !log Deploy schema change on s5 primary master (db1070) T86338 T202167 [06:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:39] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1110" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477457 (owner: 10Marostegui) [06:53:56] (03PS3) 10Legoktm: extdist: Switch to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/475579 (https://phabricator.wikimedia.org/T210312) [06:53:58] (03CR) 10Legoktm: extdist: Switch to Python 3 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/475579 (https://phabricator.wikimedia.org/T210312) (owner: 10Legoktm) [06:57:40] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:10:14] 10Operations, 10Puppet, 10ORES, 10Scoring-platform-team, 10Wikimedia-Incident: Logrotate should restart services when more people are around - https://phabricator.wikimedia.org/T210720 (10akosiaris) 05Open>03Resolved a:03akosiaris I 'll do so, thanks [07:12:29] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::php: add auto_prepend_file to fcgi [puppet] - 10https://gerrit.wikimedia.org/r/477463 [07:12:31] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::vhost: remove serve_php7 feature flag [puppet] - 10https://gerrit.wikimedia.org/r/477464 [07:12:33] (03PS1) 10Giuseppe Lavagetto: mediawiki: allow proxying to php-fpm via a unix socket [puppet] - 10https://gerrit.wikimedia.org/r/477465 [07:14:08] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: allow proxying to php-fpm via a unix socket [puppet] - 10https://gerrit.wikimedia.org/r/477465 (owner: 10Giuseppe Lavagetto) [07:19:11] (03PS2) 10Giuseppe Lavagetto: mediawiki: allow proxying to php-fpm via a unix socket [puppet] - 10https://gerrit.wikimedia.org/r/477465 [07:20:21] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: allow proxying to php-fpm via a unix socket [puppet] - 10https://gerrit.wikimedia.org/r/477465 (owner: 10Giuseppe Lavagetto) [07:22:55] !log Deploy schema change on wikitech primary master (db1073) for labswiki and labtestwiki T86338 T202167 [07:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:59] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [07:23:00] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [07:26:09] !log Deploy schema change on s4 codfw master (db2051) with replication T86338 T202167 [07:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:00] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::mediawiki::php: add auto_prepend_file to fcgi [puppet] - 10https://gerrit.wikimedia.org/r/477463 (owner: 10Giuseppe Lavagetto) [07:29:52] 10Operations, 10ops-codfw, 10netops: codfw row D recable and add QFX - https://phabricator.wikimedia.org/T210467 (10elukey) Need to check with Joe but I'd do the following: * replace mw2287 in mcrouter codfw proxy config with another one in DX (with X!=4) * before the maintenance, remove mc2033 from the mcr... [07:35:30] (03PS1) 10Elukey: mcrouter: replace codfw proxy before maintenance [puppet] - 10https://gerrit.wikimedia.org/r/477472 (https://phabricator.wikimedia.org/T210467) [07:38:32] (03PS1) 10Elukey: mcrouter: temporary remove mc2033 to ease network maintenance [puppet] - 10https://gerrit.wikimedia.org/r/477473 (https://phabricator.wikimedia.org/T210467) [07:41:01] RECOVERY - Restbase root url on restbase2013 is OK: HTTP OK: HTTP/1.1 200 - 16164 bytes in 0.118 second response time [07:41:58] 10Operations, 10Core Platform Team Backlog (Next), 10Services (next): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10elukey) As mentioned in https://phabricator.wikimedia.org/T209711#4788954 I am looping in @hashar to also allow Releng to test NodeJS 10 :) [07:42:12] (03PS2) 10Elukey: mcrouter: temporary remove mc2033 to ease network maintenance [puppet] - 10https://gerrit.wikimedia.org/r/477473 (https://phabricator.wikimedia.org/T210467) [07:45:43] (03PS1) 10Muehlenhoff: Remove access for arnad [puppet] - 10https://gerrit.wikimedia.org/r/477474 [07:47:44] (03CR) 10Muehlenhoff: [C: 032] Remove access for arnad [puppet] - 10https://gerrit.wikimedia.org/r/477474 (owner: 10Muehlenhoff) [07:48:17] 10Operations, 10ops-codfw, 10Patch-For-Review, 10Services (watching), 10User-fgiunchedi: rack/setup/install restbase201[3-8].codfw.wmnet - https://phabricator.wikimedia.org/T209615 (10fgiunchedi) >>! In T209615#4796095, @RobH wrote: > @fgiunchedi: Can you advise if these are fully online, and if so, can... [07:49:21] !log bootstrap cassandra-c on restbase2014 - T209615 [07:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:25] T209615: rack/setup/install restbase201[3-8].codfw.wmnet - https://phabricator.wikimedia.org/T209615 [07:50:36] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::web::vhost: remove serve_php7 feature flag [puppet] - 10https://gerrit.wikimedia.org/r/477464 (owner: 10Giuseppe Lavagetto) [07:50:54] (03PS2) 10Giuseppe Lavagetto: mediawiki::web::vhost: remove serve_php7 feature flag [puppet] - 10https://gerrit.wikimedia.org/r/477464 [07:51:24] 10Operations, 10ops-codfw: wtp2020: correctable memory errors - https://phabricator.wikimedia.org/T205712 (10fgiunchedi) This is back, any chance for reseating or swapping memory @papaul ? [07:58:39] (03CR) 10Filippo Giunchedi: "Thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/477366 (https://phabricator.wikimedia.org/T210486) (owner: 10Cwhite) [07:58:58] (03CR) 10Filippo Giunchedi: [C: 031] Add graphite cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/477367 (https://phabricator.wikimedia.org/T210486) (owner: 10Cwhite) [08:04:35] 10Operations, 10Traffic, 10netops: IPv6 ~20ms higher ping than IPv4 to gerrit on last ntt hop - https://phabricator.wikimedia.org/T211079 (10Reedy) [08:06:07] 10Operations, 10Traffic, 10netops: IPv6 ~20ms higher ping than IPv4 to gerrit on last ntt hop - https://phabricator.wikimedia.org/T211079 (10Reedy) [08:08:21] 10Operations, 10ops-codfw: Degraded RAID on ms-be2021 - https://phabricator.wikimedia.org/T208096 (10fgiunchedi) 05Open>03Resolved LGTM on my side too, I've reenabled the event handler. [08:11:02] !log installing perl security updates on jessie/trusty (stretch already updated) [08:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:04] (03PS1) 10Elukey: service::node: add the 'use_nodejs10' parameter [puppet] - 10https://gerrit.wikimedia.org/r/477475 (https://phabricator.wikimedia.org/T210704) [08:22:18] PROBLEM - puppet last run on stat1007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): User[arnad] [08:32:07] ^ fixed [08:33:53] _joe_: hey, FWIW, I didn't even suggested sentinel, It was suggested by SRE. I looked at it and it looked like the only option [08:35:55] !log graphite1004 & graphite2003, /var/lib/carbon/whisper/daily/wikidata/api/wbgetclaims$ sudo -u _graphite find . -type f -name "*.wsp" -delete # T140280 [08:35:57] !log restarting stuck tilerator on maps* - T204047 [08:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:59] T140280: Delete daily.wikidata.api.wbgetclaims.properties.* graphite metrics - https://phabricator.wikimedia.org/T140280 [08:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:02] T204047: investigate tilerator crash on maps eqiad - https://phabricator.wikimedia.org/T204047 [08:37:44] RECOVERY - tilerator on maps1003 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.037 second response time [08:37:48] RECOVERY - tilerator on maps1001 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.035 second response time [08:37:52] RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [08:37:54] RECOVERY - tilerator on maps1002 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.049 second response time [08:38:08] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for Christoph Jauera (WMDE-Fisch) - https://phabricator.wikimedia.org/T211014 (10Tobi_WMDE_SW) I'm approving this ticket from my side in my role as the responsible Engineering Manager at WMDE. The topic has been discussed within the teams her... [08:41:37] !log graphite1004 & graphite2003, /var/lib/carbon/whisper/daily/wikidata/datamodel$ sudo -u _graphite rm wikipedia_references.wsp # T121521 [08:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:41] T121521: Delete daily.wikidata.datamodel.wikipedia_references graphite metric - https://phabricator.wikimedia.org/T121521 [08:42:19] !log Deploy schema change on dbstore1002:s4 T86338 T202167 [08:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:26] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [08:42:26] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [08:43:45] !log Deploy schema change on db1102:3314 T86338 T202167 [08:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:16] !log graphite1004 & graphite2003, /var/lib/carbon/whisper/daily/wikidata/api/actions$ sudo -u _graphite find . -type f -name "*_*.wsp" -delete # T120639 [08:46:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:20] T120639: Delete daily.wikidata.api.actions..... metrics that were incorrectly added - https://phabricator.wikimedia.org/T120639 [08:46:20] !log graphite1004 & graphite2003, /var/lib/carbon/whisper/daily/wikidata/api/actions$ sudo -u _graphite find . -type f -name "*-*.wsp" -delete # T120639 [08:46:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:11] (03PS1) 10Marostegui: db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477495 (https://phabricator.wikimedia.org/T86338) [08:52:20] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477495 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [08:53:25] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477495 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [08:54:28] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1097:3314 T86338 T202167 (duration: 00m 47s) [08:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:32] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [08:54:33] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [08:54:36] !log Deploy schema change on db1097:3314 T86338 T202167 [08:54:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:49] !log graphite1004 & graphite2003, /var/lib/carbon/whisper/MediaWiki/electronpdf/action # Ran https://phabricator.wikimedia.org/P7882 for T157012 [09:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:52] T157012: Delete most MediaWiki.electronpdf.action.* graphite metrics - https://phabricator.wikimedia.org/T157012 [09:02:02] PROBLEM - MediaWiki memcached error rate on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1panelId=1fullscreen [09:04:11] mmmm [09:04:26] RECOVERY - MediaWiki memcached error rate on graphite1004 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1panelId=1fullscreen [09:05:17] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477495 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [09:06:10] addshore: re: graphite, thanks! btw now you shouldn't be able to access anymore the non-active graphite hosts slated for decom [09:06:26] so that should make it more clear what's active and what's not [09:06:41] a lot of tkos from mcrouter [09:06:42] https://grafana.wikimedia.org/dashboard/db/mcrouter?panelId=6&fullscreen&orgId=1&var-source=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-instance=All [09:06:50] godog: that's sounds perfect :) [09:08:09] spot checking on mcrouter logs it seems mc1022.eqiad.wmnet [09:08:36] yeah https://grafana.wikimedia.org/dashboard/db/memcache?panelId=44&fullscreen&orgId=1 [09:08:53] addshore, do you remember which logo was wrong (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/477255) yesterday? It's not mentioned in the logs and the screenshot linked in logs is gone now... [09:11:11] Urbanecm: i dont know, but im sure lucas can tlel us when he arrives :) [09:11:12] !log add elastic2038 to cirrus eqiad (new server) - T210265 [09:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:16] T210265: Setup elasticsearch on new codfw servers - https://phabricator.wikimedia.org/T210265 [09:11:16] memcached - it seems again the problem of T203786 [09:11:17] T203786: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 [09:12:14] * Urbanecm indented to ping both Lucas and Zoranzoki, but only addshore was here [09:12:25] thanks, will ask him later [09:13:10] 10Operations, 10Gadgets, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.6; 2018-11-27), and 4 others: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) Between 8:10 and 9 UTC this morning there were enough TKOs t... [09:13:44] updated task --^ [09:14:37] Urbanecm: :) i'll try to poke him when he walks past me too! [09:17:31] (03PS2) 10Gehel: elasticsearch: create base data dir [puppet] - 10https://gerrit.wikimedia.org/r/477314 [09:19:16] (03CR) 10Gehel: [C: 032] elasticsearch: create base data dir [puppet] - 10https://gerrit.wikimedia.org/r/477314 (owner: 10Gehel) [09:19:37] (03CR) 10Muehlenhoff: "One nit, but looks fine." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/477475 (https://phabricator.wikimedia.org/T210704) (owner: 10Elukey) [09:27:16] (03PS2) 10Elukey: service::node: add the 'use_nodejs10' parameter [puppet] - 10https://gerrit.wikimedia.org/r/477475 (https://phabricator.wikimedia.org/T210704) [09:29:11] (03PS1) 10Mobrovac: Citoid: Switch to using Zotero v2 [puppet] - 10https://gerrit.wikimedia.org/r/477498 (https://phabricator.wikimedia.org/T197242) [09:31:25] (03PS1) 10Elukey: turnilo: add nodejs 10 stretch apt component [puppet] - 10https://gerrit.wikimedia.org/r/477499 (https://phabricator.wikimedia.org/T210705) [09:31:33] !log add elastic2039-2044 to cirrus eqiad (new server) - T210265 [09:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:36] T210265: Setup elasticsearch on new codfw servers - https://phabricator.wikimedia.org/T210265 [09:32:06] (03CR) 10Elukey: [C: 032] turnilo: add nodejs 10 stretch apt component [puppet] - 10https://gerrit.wikimedia.org/r/477499 (https://phabricator.wikimedia.org/T210705) (owner: 10Elukey) [09:33:59] (03CR) 10Elukey: [C: 04-1] "Dependency cycle! Removing before constraint.." [puppet] - 10https://gerrit.wikimedia.org/r/477475 (https://phabricator.wikimedia.org/T210704) (owner: 10Elukey) [09:36:34] <_joe_> Amir1: that's not what I gathered from the tasks. akosiaris merely suggested we could look into sentinel [09:37:34] (03PS1) 10Elukey: turnilo: fix dependency cycle removing require_package [puppet] - 10https://gerrit.wikimedia.org/r/477500 (https://phabricator.wikimedia.org/T210704) [09:37:37] <_joe_> anyways, my point is [09:38:12] (03CR) 10Muehlenhoff: "Ah, yes. You can't use require_package together with a package relation shit, so the require_package needs to be converted to package->pre" [puppet] - 10https://gerrit.wikimedia.org/r/477475 (https://phabricator.wikimedia.org/T210704) (owner: 10Elukey) [09:39:03] (03CR) 10Elukey: [C: 032] turnilo: fix dependency cycle removing require_package [puppet] - 10https://gerrit.wikimedia.org/r/477500 (https://phabricator.wikimedia.org/T210704) (owner: 10Elukey) [09:39:04] <_joe_> if you want SRE to provide oncall rotation for a new data storage solution, you need to allow us the time to assess it [09:40:36] <_joe_> and also I'd like to see a reasoning as to why you prefer sentinel over redis-cluster [09:40:56] _joe_: I had the impression that it was assessed, sorry for the misunderstanding [09:41:04] <_joe_> yeah np :) [09:41:11] I'm writing my reasoning in depth in the ticket [09:41:21] <_joe_> Amir1: the reason why I was alarmed is I've had bad experiences with redis sentinel in the past [09:41:27] <_joe_> but it was very very *new* [09:41:37] <_joe_> also dynomite wouldn't work, I checked in the meantime [09:41:59] Redis transaction, I looked at it before going to sentinel [09:43:46] <_joe_> it's a pity, I trust dynomite more than sentinel overall [09:45:10] <_joe_> Amir1: just to be clear, I'm convinced you probably did your homework, but we need to do ours as well. For example, how easy it is to monitor sentinel? is there an existing prometheus exporter that supports it properly? what are the failure scenarios? and the recovery? [09:45:45] <_joe_> also: doesn't a taks queueing system a bit less horrible than celery that we could use? :P [09:45:47] I totally understand. I want to help you assess it [09:45:52] <_joe_> <3 [09:46:23] * akosiaris around, in the middle of a migration [09:46:33] _joe_: I improved celery so much in the past quarter (reducing responses time to 1/4) [09:46:38] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=elasticsearch [09:46:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:41] !log disable puppet on scb for citoid migration to zoterov2 T197242 [09:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:46] T197242: Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 [09:46:48] by upgrading to celery4 and using a redis task tracker (deduplicator) [09:47:29] <_joe_> Amir1: yeah, I'm just wondering if we wouldn't be in general better off using changeprop calling an uwsgi endpoint [09:47:52] (03PS2) 10Alexandros Kosiaris: Citoid: Switch to using Zotero v2 [puppet] - 10https://gerrit.wikimedia.org/r/477498 (https://phabricator.wikimedia.org/T197242) (owner: 10Mobrovac) [09:48:00] (03CR) 10Alexandros Kosiaris: [C: 032] Citoid: Switch to using Zotero v2 [puppet] - 10https://gerrit.wikimedia.org/r/477498 (https://phabricator.wikimedia.org/T197242) (owner: 10Mobrovac) [09:48:16] (03PS3) 10Elukey: service::node: add the 'use_nodejs10' parameter [puppet] - 10https://gerrit.wikimedia.org/r/477475 (https://phabricator.wikimedia.org/T210704) [09:48:19] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Citoid: Switch to using Zotero v2 [puppet] - 10https://gerrit.wikimedia.org/r/477498 (https://phabricator.wikimedia.org/T197242) (owner: 10Mobrovac) [09:48:44] <_joe_> akosiaris: can I set sca* on fire and pour salt on the racks? [09:49:03] _joe_: nope, that's my privilege [09:49:11] request denied [09:49:15] you had your fun with ocg [09:49:32] <_joe_> yeah fair enough [09:49:41] <_joe_> zotero is your rosebud [09:49:49] if only [09:50:22] !log enable puppet on scb2001, run puppet T197242 [09:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:41] !log mobrovac@deploy1001 Started deploy [citoid/deploy@b902865]: Switch Citoid to Zotero v2 on scb2001 - T197242 [09:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:11] !log mobrovac@deploy1001 Finished deploy [citoid/deploy@b902865]: Switch Citoid to Zotero v2 on scb2001 - T197242 (duration: 00m 30s) [09:52:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:14] T197242: Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 [09:52:30] !log upgrading nginx on elasticsearch codfw [09:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:50] !log enable puppet on all scb2*, run puppet T197242 [09:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:10] 10Operations, 10ops-codfw: rack/setup/install codfw logstash elasticsearch storage servers - https://phabricator.wikimedia.org/T211065 (10fgiunchedi) [09:57:20] 10Operations, 10ops-codfw: rack/setup/install codfw logstash elasticsearch storage servers - https://phabricator.wikimedia.org/T211065 (10fgiunchedi) @papaul names replaced! thanks [09:59:23] !log upgrading nginx on elasticsearch eqiad [09:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:19] 10Operations, 10Performance-Team, 10Wikidata, 10Wikidata-Query-Service: Errors trying to fetch RDF from Wikidata - https://phabricator.wikimedia.org/T207718 (10ema) >>! In T207718#4795152, @Imarlier wrote: > - If it possible for nginx to be restarted (interrupting existing persistent connections) due to co... [10:01:04] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [10:01:18] !log mobrovac@deploy1001 Started deploy [citoid/deploy@b902865]: Switch Citoid to Zotero v2 in codfw - T197242 [10:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:21] T197242: Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 [10:02:18] 10Operations, 10ops-codfw, 10Core Platform Team, 10Services (doing), and 2 others: Reshape RESTBase Cassandra cluster for server refresh - https://phabricator.wikimedia.org/T210843 (10fgiunchedi) [10:02:33] !log bootstrap cassandra-a on restbase2015 - T210843 [10:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:37] T210843: Reshape RESTBase Cassandra cluster for server refresh - https://phabricator.wikimedia.org/T210843 [10:03:03] !log mobrovac@deploy1001 Finished deploy [citoid/deploy@b902865]: Switch Citoid to Zotero v2 in codfw - T197242 (duration: 01m 45s) [10:03:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:41] 10Operations, 10Performance-Team, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: Errors trying to fetch RDF from Wikidata - https://phabricator.wikimedia.org/T207718 (10ema) [10:10:02] 10Operations, 10Citoid, 10Services (watching), 10VisualEditor (Current work): Decreased internationalisation of automatic citations as a result of switch to new translation-server - https://phabricator.wikimedia.org/T210806 (10danstillman) We've implemented `Accept-Language` forwarding upstream. [10:18:50] (03PS1) 10Banyek: mariadb: materialized view generator for analytics team [puppet] - 10https://gerrit.wikimedia.org/r/477503 (https://phabricator.wikimedia.org/T210693) [10:19:23] (03CR) 10jerkins-bot: [V: 04-1] mariadb: materialized view generator for analytics team [puppet] - 10https://gerrit.wikimedia.org/r/477503 (https://phabricator.wikimedia.org/T210693) (owner: 10Banyek) [10:20:09] 10Operations, 10Maps, 10Discovery-Search (Current work), 10Patch-For-Review: Cron test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ) - https://phabricator.wikimedia.org/T210940 (10Mathew.onipe) 05Open>03Resolved I tested this on maps-production. Everything seem... [10:22:11] (03PS2) 10Banyek: mariadb: materialized view generator for analytics team [puppet] - 10https://gerrit.wikimedia.org/r/477503 (https://phabricator.wikimedia.org/T210693) [10:22:43] (03CR) 10jerkins-bot: [V: 04-1] mariadb: materialized view generator for analytics team [puppet] - 10https://gerrit.wikimedia.org/r/477503 (https://phabricator.wikimedia.org/T210693) (owner: 10Banyek) [10:24:43] (03PS3) 10Banyek: mariadb: materialized view generator for analytics team [puppet] - 10https://gerrit.wikimedia.org/r/477503 (https://phabricator.wikimedia.org/T210693) [10:25:36] (03CR) 10jerkins-bot: [V: 04-1] mariadb: materialized view generator for analytics team [puppet] - 10https://gerrit.wikimedia.org/r/477503 (https://phabricator.wikimedia.org/T210693) (owner: 10Banyek) [10:28:56] Urbanecm: Lucas_WMDE is here now :D [10:29:04] addshore, wm-bot just pinged me :D [10:29:19] hehe [10:29:28] Lucas_WMDE, do you remember which logo was bad yesterday (https://gerrit.wikimedia.org/r/c/477139/)? [10:29:54] PROBLEM - puppet last run on scb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:30:02] PROBLEM - puppet last run on scb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:30:02] * akosiaris fixing this ^ [10:30:04] Urbanecm: I think I took the screenshot at 150% zoom [10:30:12] but if I recall correctly, at 200% it still looked broken [10:30:17] (the wiki was itwikisource) [10:30:31] Thanks, I'll investigate this. [10:30:37] I also left a comment at https://phabricator.wikimedia.org/T150618#4794805 [10:30:38] PROBLEM - puppet last run on scb1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:30:42] Did you try other touched wikis too? [10:30:49] no, unfortunately [10:30:51] ah it doesn't really need fixing right now, just ignore [10:31:00] PROBLEM - cassandra-b CQL 10.192.32.111:9042 on restbase2016 is CRITICAL: connect to address 10.192.32.111 and port 9042: Connection refused [10:31:00] PROBLEM - cassandra-c CQL 10.192.32.175:9042 on restbase2016 is CRITICAL: connect to address 10.192.32.175 and port 9042: Connection refused [10:31:06] Ok, thank you Lucas_WMDE [10:31:14] PROBLEM - cassandra-a service on restbase2016 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [10:31:44] PROBLEM - cassandra-a SSL 10.192.32.108:7001 on restbase2016 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [10:31:52] PROBLEM - cassandra-b SSL 10.192.32.111:7001 on restbase2016 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [10:33:42] PROBLEM - puppet last run on scb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:34:37] <_joe_> godog: I assume it's you installing it? [10:34:48] <_joe_> restbase2016 I mean [10:35:21] ugh, yeah expired downtime, apologies [10:35:53] <_joe_> no problems :) [10:36:20] PROBLEM - cassandra-b service on restbase2016 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [10:41:51] !log rebooting analytics-tool1001 to pick up SSBD-enabled qemu [10:41:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:01] !log mobrovac@deploy1001 Started deploy [restbase/deploy@8abcbda]: Disable Citoid test for switching it to Zotero v2 - T211088 T197242 [10:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:05] T197242: Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 [10:43:06] T211088: Wikipedia pages parsed as website, not encyclopedia - https://phabricator.wikimedia.org/T211088 [10:43:30] 10Operations: puppet (systemd::service) attempts to start masked units - https://phabricator.wikimedia.org/T211027 (10fgiunchedi) Looks like this is working as intended for `systemd` provider (`/usr/lib/ruby/vendor_ruby/puppet/provider/service/systemd.rb`) ` def enable self.unmask systemctl_change_ena... [10:46:11] 10Operations: puppet (systemd::service) attempts to start masked units - https://phabricator.wikimedia.org/T211027 (10fgiunchedi) [10:47:59] !log rebooting analytics-tool1002 to pick up SSBD-enabled qemu [10:48:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:21] !log rebooting analytics-tool1003 to pick up SSBD-enabled qemu [10:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:12] (03CR) 10Gehel: [C: 04-1] "see comments inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/477429 (https://phabricator.wikimedia.org/T210044) (owner: 10Smalyshev) [10:57:04] !log deploying AQS to expose offset and underestimate numbers on unique devices [10:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:42] !log fdans@deploy1001 Started deploy [analytics/aqs/deploy@e9a63cc]: Deploying offset and underestimate numbers for uniques [10:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:20] !log fdans@deploy1001 Finished deploy [analytics/aqs/deploy@e9a63cc]: Deploying offset and underestimate numbers for uniques (duration: 00m 37s) [10:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:06] !log elukey@deploy1001 Started deploy [analytics/aqs/deploy@e9a63cc]: Expose offset and underestimate numbers on unique devices - T164201 [11:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:09] T164201: AQS unique devices api should report offset/underestimate separately - https://phabricator.wikimedia.org/T164201 [11:03:59] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@8abcbda]: Disable Citoid test for switching it to Zotero v2 - T211088 T197242 (duration: 20m 59s) [11:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:03] T197242: Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 [11:04:04] T211088: Wikipedia pages parsed as website, not encyclopedia - https://phabricator.wikimedia.org/T211088 [11:08:36] (03CR) 10Marostegui: "I have added some comments to keep in mind when starting the most important part of the patch, the script that will generate the materiali" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/477503 (https://phabricator.wikimedia.org/T210693) (owner: 10Banyek) [11:11:14] (03CR) 10Banyek: "Thanks for those!" [puppet] - 10https://gerrit.wikimedia.org/r/477503 (https://phabricator.wikimedia.org/T210693) (owner: 10Banyek) [11:11:47] (03PS1) 10Arturo Borrero Gonzalez: toolforge: homogenize system::role calls [puppet] - 10https://gerrit.wikimedia.org/r/477508 [11:12:10] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477509 [11:12:12] !log elukey@deploy1001 Finished deploy [analytics/aqs/deploy@e9a63cc]: Expose offset and underestimate numbers on unique devices - T164201 (duration: 09m 06s) [11:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:16] T164201: AQS unique devices api should report offset/underestimate separately - https://phabricator.wikimedia.org/T164201 [11:12:42] (03CR) 10Arturo Borrero Gonzalez: [C: 032] toolforge: homogenize system::role calls [puppet] - 10https://gerrit.wikimedia.org/r/477508 (owner: 10Arturo Borrero Gonzalez) [11:17:04] !log enable puppet on scb1001, run puppet T197242 [11:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:07] T197242: Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 [11:18:26] !log mobrovac@deploy1001 Started deploy [citoid/deploy@b902865]: Switch Citoid to Zotero v2 on scb1001 - T197242 [11:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:56] !log mobrovac@deploy1001 Finished deploy [citoid/deploy@b902865]: Switch Citoid to Zotero v2 on scb1001 - T197242 (duration: 00m 30s) [11:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:24] RECOVERY - puppet last run on scb1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:25:21] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477509 (owner: 10Marostegui) [11:26:27] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477509 (owner: 10Marostegui) [11:27:25] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1097:3314 T86338 T202167 (duration: 00m 46s) [11:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:30] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [11:27:30] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [11:28:29] (03PS3) 10Giuseppe Lavagetto: mediawiki: allow proxying to php-fpm via a unix socket [puppet] - 10https://gerrit.wikimedia.org/r/477465 [11:28:53] 10Operations, 10cloud-services-team, 10monitoring, 10User-fgiunchedi: Port DirectorySize diamond collector to a Prometheus exporter - https://phabricator.wikimedia.org/T211094 (10MoritzMuehlenhoff) p:05Triage>03Normal [11:29:41] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: allow proxying to php-fpm via a unix socket [puppet] - 10https://gerrit.wikimedia.org/r/477465 (owner: 10Giuseppe Lavagetto) [11:30:47] (03PS1) 10Marostegui: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477510 (https://phabricator.wikimedia.org/T86338) [11:31:52] !log enable puppet on scb1002, run puppet T197242 [11:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:55] T197242: Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 [11:32:04] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477510 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [11:33:08] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477510 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [11:33:14] !log mobrovac@deploy1001 Started deploy [citoid/deploy@b902865]: Switch Citoid to Zotero v2 on scb1002 - T197242 [11:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:35] 10Operations, 10Analytics, 10SRE-Access-Requests: Grant fdans permissions to deploy AQS in prod, and accessing the aqs hosts - https://phabricator.wikimedia.org/T211095 (10fdans) [11:33:42] !log mobrovac@deploy1001 Finished deploy [citoid/deploy@b902865]: Switch Citoid to Zotero v2 on scb1002 - T197242 (duration: 00m 28s) [11:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:11] !log enable puppet on scb1003, run puppet T197242 [11:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:24] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1084 T86338 T202167 (duration: 00m 47s) [11:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:29] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [11:34:29] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [11:35:28] RECOVERY - puppet last run on scb1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:35:30] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477509 (owner: 10Marostegui) [11:35:32] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477510 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [11:35:41] !log mobrovac@deploy1001 Started deploy [citoid/deploy@b902865]: Switch Citoid to Zotero v2 on scb1003 - T197242 [11:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:50] !log Deploy schema change on db1084 T86338 T202167 [11:35:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:01] !log mobrovac@deploy1001 Finished deploy [citoid/deploy@b902865]: Switch Citoid to Zotero v2 on scb1003 - T197242 (duration: 00m 20s) [11:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:44] !log enable puppet on scb1004, run puppet T197242 [11:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:36] RECOVERY - puppet last run on scb1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:37:40] !log rebooting puppetboard1001 to pick up SSBD-enabled qemu [11:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:20] !log mobrovac@deploy1001 Started deploy [citoid/deploy@b902865]: Switch Citoid to Zotero v2 on scb1004 - T197242 [11:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:23] T197242: Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 [11:38:41] !log mobrovac@deploy1001 Finished deploy [citoid/deploy@b902865]: Switch Citoid to Zotero v2 on scb1004 - T197242 (duration: 00m 21s) [11:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:03] (03PS1) 10GTirloni: PAWS: Pin Kubernetes packages to version 1.13.0 [puppet] - 10https://gerrit.wikimedia.org/r/477514 (https://phabricator.wikimedia.org/T211096) [11:41:03] !log rebooting puppetboard2001 to pick up SSBD-enabled qemu [11:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:58] RECOVERY - puppet last run on scb1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:43:01] 10Operations, 10Citoid, 10Services (done), 10VisualEditor (Current work): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 (10mobrovac) 05Open>03Resolved Citoid in production has been switched to use Zotero v2. [11:44:31] !log mobrovac@deploy1001 Started deploy [restbase/deploy@8abcbda] (dev-cluster): (no justification provided) [11:44:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:02] (03PS2) 10Revi: Add SPF record for wikidata.org [dns] - 10https://gerrit.wikimedia.org/r/477034 (https://phabricator.wikimedia.org/T210134) [11:46:29] (03PS2) 10GTirloni: PAWS: Pin Kubernetes and Docker-CE packages [puppet] - 10https://gerrit.wikimedia.org/r/477514 (https://phabricator.wikimedia.org/T211096) [11:48:01] (03PS4) 10Giuseppe Lavagetto: mediawiki: allow proxying to php-fpm via a unix socket [puppet] - 10https://gerrit.wikimedia.org/r/477465 [11:49:19] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@8abcbda] (dev-cluster): (no justification provided) (duration: 04m 47s) [11:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:46] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/13827/" [puppet] - 10https://gerrit.wikimedia.org/r/477465 (owner: 10Giuseppe Lavagetto) [12:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181204T1200). [12:00:04] Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:17] * Urbanecm waves [12:12:25] (03CR) 10Mobrovac: service::node: add the 'use_nodejs10' parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/477475 (https://phabricator.wikimedia.org/T210704) (owner: 10Elukey) [12:16:08] anybody to do SWAT? [12:16:15] addshore, ? [12:16:20] zeljkof, ? [12:16:47] (03CR) 10Muehlenhoff: [C: 031] service::node: add the 'use_nodejs10' parameter [puppet] - 10https://gerrit.wikimedia.org/r/477475 (https://phabricator.wikimedia.org/T210704) (owner: 10Elukey) [12:17:13] !log installing tiff security updates [12:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:57] (03PS1) 10Lucas Werkmeister (WMDE): Specify $wgWBRepoSettings['conceptBaseUri'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477521 [12:20:59] (03PS1) 10Lucas Werkmeister (WMDE): Fix Wikidata base URI in client config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477522 (https://phabricator.wikimedia.org/T198946) [12:21:55] (03CR) 10jerkins-bot: [V: 04-1] Specify $wgWBRepoSettings['conceptBaseUri'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477521 (owner: 10Lucas Werkmeister (WMDE)) [12:22:03] (03CR) 10jerkins-bot: [V: 04-1] Fix Wikidata base URI in client config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477522 (https://phabricator.wikimedia.org/T198946) (owner: 10Lucas Werkmeister (WMDE)) [12:22:11] Lucas_WMDE, could you run a SWAT again? No SWAT conductor appeared within 20 minutes :( [12:23:04] Urbanecm, isn't this a recurring occurrence? may want to chat with greg about it later [12:23:12] recurring occurrence? [12:23:16] Urbanecm: I can start, but eventually we’ll go for lunch (addshore is in the WMDE office as well today) [12:23:17] what am I doing to the english language today [12:23:17] EU SWAT is normally fine [12:23:23] isn't this a recurring thing? [12:23:45] (03PS1) 10Mathew.onipe: elasticsearch: add new elastic2045-elastic2054 [puppet] - 10https://gerrit.wikimedia.org/r/477523 (https://phabricator.wikimedia.org/T210265) [12:23:56] EU SWAT is normally fine (honestly, that's the reason why I use only EU SWAT even normally available during the Morning one too) [12:24:05] (03PS3) 10Lucas Werkmeister (WMDE): Revert "Milestone logo for atjwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449445 (https://phabricator.wikimedia.org/T200713) (owner: 10Urbanecm) [12:24:23] (03CR) 10Lucas Werkmeister (WMDE): [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449445 (https://phabricator.wikimedia.org/T200713) (owner: 10Urbanecm) [12:24:27] Lucas_WMDE, okay, thanks. Deploy as many patches as you can, will reschedule the rest [12:25:33] (03Merged) 10jenkins-bot: Revert "Milestone logo for atjwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449445 (https://phabricator.wikimedia.org/T200713) (owner: 10Urbanecm) [12:25:56] Krenair, BTW, _who_ may want to chat with greg about it later? Me or you? [12:26:02] Urbanecm: first patch should be on mwdebug1002 now [12:26:07] looking [12:26:22] Urbanecm, you [12:26:27] and maybe others [12:26:33] I don't use it that often anymore [12:26:50] Lucas_WMDE, works, please deploy [12:26:59] Urbanecm, Krenair: fwiw, I pinged him yesterday that I could perhaps join the SWAT team, in #wikimedia-releng [12:27:01] https://wm-bot.wmflabs.org/browser/index.php?start=12%2F03%2F2018&end=12%2F03%2F2018&display=%23wikimedia-releng [12:27:03] ok [12:28:02] (03CR) 10jenkins-bot: Revert "Milestone logo for atjwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449445 (https://phabricator.wikimedia.org/T200713) (owner: 10Urbanecm) [12:28:05] thanks for the info Lucas_WMDE [12:28:18] 10Operations, 10Traffic, 10netops: IPv6 ~20ms higher ping than IPv4 to gerrit on last ntt hop - https://phabricator.wikimedia.org/T211079 (10BBlack) From bast1001 to the endpoints shown in line (2) above over v4 and v6: ` bblack@bast1002:~$ mtr -c 10 -r -4 bottomless.aa.net.uk Start: Tue Dec 4 12:23:35 2018... [12:28:35] !log lucaswerkmeister-wmde@deploy1001 Synchronized static/images/project-logos/: SWAT: [[gerrit:449445|Revert "Milestone logo for atjwiki" (T200713)]] (duration: 00m 47s) [12:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:38] T200713: Change logo on atj.wp for 4 months - https://phabricator.wikimedia.org/T200713 [12:28:46] Urbanecm: is it possible to test a new namespace change on mwdebug? [12:28:54] should be [12:28:57] ok [12:29:03] (03PS2) 10Lucas Werkmeister (WMDE): Create List namespace on euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477175 (https://phabricator.wikimedia.org/T209834) (owner: 10Urbanecm) [12:29:20] lets hope the non-debug servers don’t get too confused if a new page in an unexpected namespace appears [12:29:33] (03CR) 10Lucas Werkmeister (WMDE): [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477175 (https://phabricator.wikimedia.org/T209834) (owner: 10Urbanecm) [12:29:36] 10Operations, 10Traffic, 10netops: IPv6 ~20ms higher ping than IPv4 to gerrit on last ntt hop - https://phabricator.wikimedia.org/T211079 (10BBlack) (But note that first hop from Ashburn to Chicago is our routers' choice, so it's possible some of our route engineering is at play here). [12:29:42] (unless the test isn’t going to involve creating a page, I guess) [12:30:11] Lucas_WMDE, I'm not going to create a page, but even if I were, the page just won't be available (if I know MediaWiki enough) [12:30:34] (03Merged) 10jenkins-bot: Create List namespace on euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477175 (https://phabricator.wikimedia.org/T209834) (owner: 10Urbanecm) [12:30:57] Urbanecm: should be on the debug server now [12:31:07] thanks, testing [12:31:44] I see the createAndPromote script is no longer in the SWAT window, is that already done? [12:31:52] (03PS1) 10Elukey: admin: add fdans to deploy-aqs [puppet] - 10https://gerrit.wikimedia.org/r/477524 (https://phabricator.wikimedia.org/T211095) [12:32:13] No, I removed it because you told me yesterday you're not comfortable with running it [12:32:21] Lucas_WMDE, please deploy the change [12:32:26] ah, okay [12:33:24] but the one I wanted to run yesterday is done, there were originally two createAndPromote rows in the calendar [12:33:52] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:477175|Create List namespace on euwiki (T209834)]] (duration: 00m 47s) [12:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:56] T209834: Create "Zerrenda:" namespace in euwiki for lists - https://phabricator.wikimedia.org/T209834 [12:34:21] Krenair, thanks for the clarification, although I'm not sure what I might want to tell Greg. [12:34:40] ok [12:34:48] (03PS2) 10Lucas Werkmeister (WMDE): Create namespace "Work" on bnwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477174 (https://phabricator.wikimedia.org/T210472) (owner: 10Urbanecm) [12:34:58] (03CR) 10Lucas Werkmeister (WMDE): [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477174 (https://phabricator.wikimedia.org/T210472) (owner: 10Urbanecm) [12:35:58] (03Merged) 10jenkins-bot: Create namespace "Work" on bnwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477174 (https://phabricator.wikimedia.org/T210472) (owner: 10Urbanecm) [12:36:27] Urbanecm: should be on mwdebug1002 now [12:36:32] looking [12:37:19] working, please deploy [12:38:25] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:477174|Create namespace "Work" on bnwikisource (T210472)]] (duration: 00m 46s) [12:38:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:28] T210472: Creation of "Work" namespace in Bengali Wikisource - https://phabricator.wikimedia.org/T210472 [12:39:02] !log EU SWAT done [12:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:29] Urbanecm: thank you for deploying with Wikimedia releng, have a nice day :) [12:40:04] I'm afraid I have no other option :D [12:40:18] Lucas_WMDE, can you please run namespaceDupes.php for euwiki and bnwikisource? [12:40:52] sorry, I should have noted it in calendar/chat before [12:41:16] (03CR) 10jenkins-bot: Create List namespace on euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477175 (https://phabricator.wikimedia.org/T209834) (owner: 10Urbanecm) [12:41:17] (03CR) 10jenkins-bot: Create namespace "Work" on bnwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477174 (https://phabricator.wikimedia.org/T210472) (owner: 10Urbanecm) [12:42:41] sorry, I’m about to leave, no time :/ [12:43:16] Hmm, at least for euwiki, that kinda breaks it... Will try to get somebody else to run it then [12:43:32] (https://phabricator.wikimedia.org/T209834#4797239) [12:45:49] Urbanecm: i can run in in an hour? :) [12:45:55] that'll be great addshore [12:47:35] Coolio! I'll be sure to ping you when I do! [12:48:09] thanks addshore [12:48:47] (03PS1) 10Alexandros Kosiaris: maintain-kubeusers: Add more allowed resources [puppet] - 10https://gerrit.wikimedia.org/r/477525 (https://phabricator.wikimedia.org/T211040) [12:54:25] (03CR) 10Arturo Borrero Gonzalez: "Where do these packages live? in which repo I mean" [puppet] - 10https://gerrit.wikimedia.org/r/477514 (https://phabricator.wikimedia.org/T211096) (owner: 10GTirloni) [12:56:20] (03CR) 10GTirloni: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/477514 (https://phabricator.wikimedia.org/T211096) (owner: 10GTirloni) [13:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181204T1300) [13:03:33] (03PS1) 10Giuseppe Lavagetto: prometheus::php_fpm_exporter: run as www-data [puppet] - 10https://gerrit.wikimedia.org/r/477528 [13:03:56] (03PS1) 10Elukey: druid: create request logs for daemons [puppet] - 10https://gerrit.wikimedia.org/r/477529 [13:05:43] (03CR) 10Giuseppe Lavagetto: [C: 032] prometheus::php_fpm_exporter: run as www-data [puppet] - 10https://gerrit.wikimedia.org/r/477528 (owner: 10Giuseppe Lavagetto) [13:05:47] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/13828/mw1261.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/477528 (owner: 10Giuseppe Lavagetto) [13:06:48] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-fgiunchedi: Setup rsyslog to be able to produce logs to Kafka - https://phabricator.wikimedia.org/T206633 (10fgiunchedi) [13:07:03] 10Operations, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 3 others: Begin the implementation of Q1's Logging Infrastructure design (2018-19 Q2 Goal) - https://phabricator.wikimedia.org/T205849 (10fgiunchedi) [13:07:08] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-fgiunchedi: Setup rsyslog to be able to produce logs to Kafka - https://phabricator.wikimedia.org/T206633 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi This is completed! [13:08:29] (03PS1) 10Arturo Borrero Gonzalez: toolforge: base: extract variables into hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/477531 [13:09:32] (03CR) 10Joal: [C: 031] "Typo in commit message, but except from that looks super :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/477529 (owner: 10Elukey) [13:09:35] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477532 [13:10:32] 10Operations, 10ops-codfw: rack/setup/install codfw logstash elasticsearch storage servers - https://phabricator.wikimedia.org/T211065 (10fgiunchedi) [13:10:35] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi, 10User-herron: Logstash hardware expansion - https://phabricator.wikimedia.org/T203169 (10fgiunchedi) [13:10:39] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477532 (owner: 10Marostegui) [13:11:41] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477532 (owner: 10Marostegui) [13:12:44] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1084 T86338 T202167 (duration: 00m 46s) [13:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:49] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [13:12:50] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [13:14:29] (03PS1) 10Marostegui: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477534 (https://phabricator.wikimedia.org/T86338) [13:14:31] 10Operations, 10ops-codfw: rack/setup/install codfw logstash elasticsearch storage servers - https://phabricator.wikimedia.org/T211065 (10fgiunchedi) Also please rack these systems across different rows, any combination of rows will do. The rest of the task LGTM [13:14:50] (03CR) 10Arturo Borrero Gonzalez: "> > Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/477514 (https://phabricator.wikimedia.org/T211096) (owner: 10GTirloni) [13:15:45] (03CR) 10Filippo Giunchedi: [C: 031] logstash: ship kafka server logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/476982 (https://phabricator.wikimedia.org/T63788) (owner: 10Herron) [13:16:15] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477534 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [13:17:15] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477534 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [13:17:30] (03PS2) 10Arturo Borrero Gonzalez: toolforge: grid: base: extract variables into hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/477531 (https://phabricator.wikimedia.org/T211055) [13:18:58] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1081 T86338 T202167 (duration: 00m 46s) [13:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:02] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [13:19:03] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [13:19:04] !log Deploy schema change on db1081 T86338 T202167 [13:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:11] (03PS1) 10CDanis: Add grafana to wikimedia-stretch apt repo. [puppet] - 10https://gerrit.wikimedia.org/r/477535 (https://phabricator.wikimedia.org/T210416) [13:20:43] 10Operations, 10ops-codfw: rack/setup/install codfw logstash elasticsearch storage servers - https://phabricator.wikimedia.org/T211065 (10Papaul) @fgiunchedi Please provide partman recipe to use. I have 4x4TB disks [13:20:44] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477532 (owner: 10Marostegui) [13:20:46] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477534 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [13:23:22] (03CR) 10GTirloni: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/477514 (https://phabricator.wikimedia.org/T211096) (owner: 10GTirloni) [13:23:39] (03PS2) 10CDanis: Add grafana to wikimedia-stretch apt repo. [puppet] - 10https://gerrit.wikimedia.org/r/477535 (https://phabricator.wikimedia.org/T210416) [13:26:31] (03PS1) 10Filippo Giunchedi: install_server: update logstash partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/477536 (https://phabricator.wikimedia.org/T211065) [13:27:19] (03CR) 10Filippo Giunchedi: [C: 032] install_server: update logstash partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/477536 (https://phabricator.wikimedia.org/T211065) (owner: 10Filippo Giunchedi) [13:27:23] (03CR) 10Muehlenhoff: [C: 031] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/477535 (https://phabricator.wikimedia.org/T210416) (owner: 10CDanis) [13:29:01] (03CR) 10CDanis: [C: 032] Add grafana to wikimedia-stretch apt repo. [puppet] - 10https://gerrit.wikimedia.org/r/477535 (https://phabricator.wikimedia.org/T210416) (owner: 10CDanis) [13:29:28] !log installing nodejs security updates on proton* [13:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:43] (03PS2) 10Filippo Giunchedi: install_server: update logstash partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/477536 (https://phabricator.wikimedia.org/T211065) [13:31:19] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install codfw logstash elasticsearch storage servers - https://phabricator.wikimedia.org/T211065 (10fgiunchedi) >>! In T211065#4797402, @Papaul wrote: > @fgiunchedi Please provide partman recipe to use. I have 4x4TB disks You can use `logstash.cfg`... [13:32:40] !log bootstrap cassandra-b on restbase2015 - T210843 [13:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:44] T210843: Reshape RESTBase Cassandra cluster for server refresh - https://phabricator.wikimedia.org/T210843 [13:33:36] !log installing nodejs security updates on restbase in codfw [13:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:13] 10Operations, 10ops-eqiad: eqiad: Re-connect cage cameras - https://phabricator.wikimedia.org/T207965 (10faidon) Any progress on this? [13:34:32] (03PS2) 10Gehel: DNS: Add mgmt and production DNS for elastic2045 - elastic2054 [dns] - 10https://gerrit.wikimedia.org/r/477436 (https://phabricator.wikimedia.org/T210450) (owner: 10Papaul) [13:34:51] !log T210416: adding grafana 5 to wikimedia-stretch: reprepro --restrict grafana update stretch-wikimedia [13:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:54] T210416: Upgrade grafana to 5.x - https://phabricator.wikimedia.org/T210416 [13:35:15] (03CR) 10Gehel: [C: 032] DNS: Add mgmt and production DNS for elastic2045 - elastic2054 [dns] - 10https://gerrit.wikimedia.org/r/477436 (https://phabricator.wikimedia.org/T210450) (owner: 10Papaul) [13:35:59] papaul: ^ [13:37:44] (03Abandoned) 10GTirloni: PAWS: Pin Kubernetes and Docker-CE packages [puppet] - 10https://gerrit.wikimedia.org/r/477514 (https://phabricator.wikimedia.org/T211096) (owner: 10GTirloni) [13:42:35] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477540 [13:43:11] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install elastic203[7-9], elastic204[0-9], elastic205[0-4] - https://phabricator.wikimedia.org/T210450 (10Papaul) [13:44:20] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477540 (owner: 10Marostegui) [13:45:51] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477540 (owner: 10Marostegui) [13:46:48] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1081 T86338 T202167 (duration: 00m 46s) [13:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:52] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [13:46:53] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [13:46:56] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477540 (owner: 10Marostegui) [13:48:00] (03PS1) 10Marostegui: db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477542 (https://phabricator.wikimedia.org/T86338) [13:49:10] jouncebot now [13:49:18] jouncebot: now [13:49:19] For the next 0 hour(s) and 10 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181204T1300) [13:49:19] For the next 0 hour(s) and 10 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181204T1300) [13:49:25] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477542 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [13:49:40] Urbanecm: want that maintenance script run? :) [13:49:46] * Lucas_WMDE is back too [13:49:55] sure, thanks [13:49:56] 10Operations, 10Traffic, 10netops: IPv6 ~20ms higher ping than IPv4 to gerrit - https://phabricator.wikimedia.org/T211079 (10faidon) p:05Triage>03High [13:49:59] * Urbanecm waves to Lucas_WMDE and addshore [13:50:26] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477542 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [13:51:08] I’ll let addshore do the honours [13:52:12] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1103:3314 T86338 T202167 (duration: 00m 47s) [13:52:15] !log Deploy schema change on db1103:3314 T86338 T202167 [13:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:17] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [13:52:17] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [13:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:32] (03PS2) 10Lucas Werkmeister (WMDE): Specify $wgWBRepoSettings['conceptBaseUri'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477521 [13:52:34] (03PS2) 10Lucas Werkmeister (WMDE): Fix Wikidata base URI in client config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477522 (https://phabricator.wikimedia.org/T198946) [13:52:47] Urbanecm: i need to add --fix right? ;) [13:52:58] yup [13:53:06] Urbanecm: for euwiki 2131 linsk to fix and are all fixable [13:53:15] ack, thanks [13:53:17] !log addshore@mwmaint1002:~$ mwscript namespaceDupes.php --wiki=euwiki --fix [13:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:39] done, will check bnwikisource now [13:54:06] bnwikisource 5 pages, 0 resolvable, 3 links, 3 resolvable [13:54:30] Urbanecm: ^^ i guess you want me to run it with --fix and paste the list somewhere? :P [13:54:57] https://www.irccloud.com/pastebin/6QWBZ2wx/ [13:55:10] thanks addshore [13:55:24] (03PS4) 10Gehel: Enable SPARQL logging to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/477429 (https://phabricator.wikimedia.org/T210044) (owner: 10Smalyshev) [13:55:28] can you run the script with --fix and --add-prefix=T210472 please? [13:55:28] 10Operations, 10Traffic, 10netops: IPv6 ~20ms higher ping than IPv4 to gerrit - https://phabricator.wikimedia.org/T211079 (10faidon) The forward paths are nearly identical, but the reverse is not: reverse path selection is HE for IPv6 and NTT for IPv4, so different paths, and latency could be reasonably expl... [13:55:28] T210472: Creation of "Work" namespace in Bengali Wikisource - https://phabricator.wikimedia.org/T210472 [13:55:34] (for bnwikisource) [13:55:40] yes [13:55:50] thanks [13:56:23] !log addshore@mwmaint1002:~$ mwscript namespaceDupes.php --wiki=bnwikisource --fix --add-prefix=T210472 [13:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:32] Urbanecm: the links remain though? 3 resolvable? is that fine? [13:57:01] yeah [13:57:06] cool, all done then! [13:57:21] thanks [13:59:10] (03PS2) 10Elukey: druid: create request/access logs for broker/historical [puppet] - 10https://gerrit.wikimedia.org/r/477529 [13:59:29] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477542 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [14:02:04] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/13829/druid1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/477529 (owner: 10Elukey) [14:03:50] Urbanecm: sorry, traveling all week, not around for swat [14:04:05] ok, thanks for the info zeljkof [14:04:30] !log upgrade turnilo on analytics-tools1002 to nodejs-10 - T210705 [14:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:33] T210705: Move turnilo to nodejs 10 - https://phabricator.wikimedia.org/T210705 [14:05:06] ah no snap the deps, sigh going to scratch that [14:06:54] (03PS1) 10CDanis: On wikimedia-stretch, add repository thirdparty/grafana [puppet] - 10https://gerrit.wikimedia.org/r/477546 (https://phabricator.wikimedia.org/T210416) [14:08:46] (03CR) 10Muehlenhoff: [C: 031] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/477546 (https://phabricator.wikimedia.org/T210416) (owner: 10CDanis) [14:12:43] (03CR) 10CDanis: "Verified no diffs to existing grafana users with puppet-compiler" [puppet] - 10https://gerrit.wikimedia.org/r/477546 (https://phabricator.wikimedia.org/T210416) (owner: 10CDanis) [14:12:49] (03CR) 10CDanis: [C: 032] On wikimedia-stretch, add repository thirdparty/grafana [puppet] - 10https://gerrit.wikimedia.org/r/477546 (https://phabricator.wikimedia.org/T210416) (owner: 10CDanis) [14:16:02] (03PS1) 10Mathew.onipe: admin: add wmde-fisch to deployment [puppet] - 10https://gerrit.wikimedia.org/r/477548 (https://phabricator.wikimedia.org/T211014) [14:16:24] (03PS3) 10Gehel: spicerack: add dateutil dependency [puppet] - 10https://gerrit.wikimedia.org/r/477281 (https://phabricator.wikimedia.org/T207919) (owner: 10Mathew.onipe) [14:17:44] (03CR) 10Gehel: [C: 032] spicerack: add dateutil dependency [puppet] - 10https://gerrit.wikimedia.org/r/477281 (https://phabricator.wikimedia.org/T207919) (owner: 10Mathew.onipe) [14:18:28] cdanis: unmerged puppet change from you. can I safely merge? [14:18:48] yes, it is a no-op on existing servers [14:18:59] cdanis: ok, thanks! will merge! [14:19:06] yep sorry for the trouble [14:19:13] no trouble at all! [14:21:52] (03CR) 10DCausse: [C: 031] Enable SPARQL logging to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/477429 (https://phabricator.wikimedia.org/T210044) (owner: 10Smalyshev) [14:22:06] (03PS5) 10Gehel: Enable SPARQL logging to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/477429 (https://phabricator.wikimedia.org/T210044) (owner: 10Smalyshev) [14:24:07] (03CR) 10Gehel: [C: 032] Enable SPARQL logging to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/477429 (https://phabricator.wikimedia.org/T210044) (owner: 10Smalyshev) [14:25:47] 10Operations, 10Citoid, 10Regression, 10VisualEditor (Current work): Citoid not communicating with Zotero translation-server in production at all - https://phabricator.wikimedia.org/T211114 (10Mvolz) p:05Triage>03Unbreak! [14:25:51] (03PS1) 10Arturo Borrero Gonzalez: toolforge: grid: introduce systemd service file for sge_qmaster [puppet] - 10https://gerrit.wikimedia.org/r/477554 (https://phabricator.wikimedia.org/T211055) [14:26:51] (03CR) 10jerkins-bot: [V: 04-1] toolforge: grid: introduce systemd service file for sge_qmaster [puppet] - 10https://gerrit.wikimedia.org/r/477554 (https://phabricator.wikimedia.org/T211055) (owner: 10Arturo Borrero Gonzalez) [14:26:54] 10Operations, 10SRE-Access-Requests: Requesting access to `researchers` group for joewalsh - https://phabricator.wikimedia.org/T211115 (10JoeWalsh) [14:27:22] 10Operations, 10Citoid, 10Regression, 10VisualEditor (Current work): Citoid not communicating with Zotero translation-server in production at all - https://phabricator.wikimedia.org/T211114 (10Mvolz) [14:30:46] (03PS2) 10Gehel: admin: add create maps-roots and add onimisionipe(Matt) to it [puppet] - 10https://gerrit.wikimedia.org/r/477294 (https://phabricator.wikimedia.org/T211020) (owner: 10Mathew.onipe) [14:31:29] (03CR) 10Gehel: [C: 032] admin: add create maps-roots and add onimisionipe(Matt) to it [puppet] - 10https://gerrit.wikimedia.org/r/477294 (https://phabricator.wikimedia.org/T211020) (owner: 10Mathew.onipe) [14:31:40] (03PS3) 10Gehel: maps: add maps-roots to maps hieradata [puppet] - 10https://gerrit.wikimedia.org/r/477298 (https://phabricator.wikimedia.org/T211020) (owner: 10Mathew.onipe) [14:31:53] 10Operations, 10Citoid, 10Services (done), 10VisualEditor (Current work): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 (10Mvolz) 05Resolved>03Open [14:32:23] (03CR) 10Gehel: [C: 032] maps: add maps-roots to maps hieradata [puppet] - 10https://gerrit.wikimedia.org/r/477298 (https://phabricator.wikimedia.org/T211020) (owner: 10Mathew.onipe) [14:32:28] 10Operations, 10Citoid, 10Services (done), 10VisualEditor (Current work): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 (10Mvolz) Re-opening due to T211114 [14:35:14] (03PS1) 10CDanis: Add role::grafana and switch grafana1001.eqiad to it [puppet] - 10https://gerrit.wikimedia.org/r/477557 (https://phabricator.wikimedia.org/T210416) [14:36:20] (03PS8) 10Filippo Giunchedi: rsyslog: add UDP localhost compatibility endpoint [puppet] - 10https://gerrit.wikimedia.org/r/475352 (https://phabricator.wikimedia.org/T205851) [14:36:22] (03PS3) 10Filippo Giunchedi: logstash: add new logging kafka consumer [puppet] - 10https://gerrit.wikimedia.org/r/476472 (https://phabricator.wikimedia.org/T205851) [14:36:24] (03PS3) 10Filippo Giunchedi: logstash: copy 'severity' into 'level' where needed [puppet] - 10https://gerrit.wikimedia.org/r/476473 (https://phabricator.wikimedia.org/T205851) [14:37:36] (03CR) 10Filippo Giunchedi: rsyslog: add UDP localhost compatibility endpoint (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/475352 (https://phabricator.wikimedia.org/T205851) (owner: 10Filippo Giunchedi) [14:39:28] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Setup elasticsearch on new codfw servers - https://phabricator.wikimedia.org/T210265 (10Mathew.onipe) [14:40:16] 10Operations, 10Maps, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Create maps-root group and add Matt(onimisionipe) to maps-roots - https://phabricator.wikimedia.org/T211020 (10Mathew.onipe) 05Open>03Resolved [14:41:17] (03CR) 10Muehlenhoff: [C: 031] "Looks fine. (but touches sudo rules and needs to be acked in next SRE meeting)" [puppet] - 10https://gerrit.wikimedia.org/r/477548 (https://phabricator.wikimedia.org/T211014) (owner: 10Mathew.onipe) [14:44:08] (03CR) 10Gehel: [C: 04-1] "Looks good, but waiting for servers to be racked before merging" [puppet] - 10https://gerrit.wikimedia.org/r/477523 (https://phabricator.wikimedia.org/T210265) (owner: 10Mathew.onipe) [14:47:33] (03CR) 10Ottomata: [C: 031] druid: create request/access logs for broker/historical [puppet] - 10https://gerrit.wikimedia.org/r/477529 (owner: 10Elukey) [14:50:38] (03PS4) 10Banyek: mariadb: materialized view generator for analytics team [puppet] - 10https://gerrit.wikimedia.org/r/477503 (https://phabricator.wikimedia.org/T210693) [14:51:08] (03CR) 10jerkins-bot: [V: 04-1] mariadb: materialized view generator for analytics team [puppet] - 10https://gerrit.wikimedia.org/r/477503 (https://phabricator.wikimedia.org/T210693) (owner: 10Banyek) [14:52:02] 10Operations, 10DBA, 10Availability (MediaWiki-MultiDC), 10Performance-Team (Radar): Investigate solutions for MySQL connection pooling - https://phabricator.wikimedia.org/T196378 (10Imarlier) @jcrespo Why would we need to deploy Mediawiki in order to repoint when the master is switched? Wouldn't the prox... [14:52:44] (03CR) 10Elukey: [C: 032] druid: create request/access logs for broker/historical [puppet] - 10https://gerrit.wikimedia.org/r/477529 (owner: 10Elukey) [14:52:47] (03PS3) 10Elukey: druid: create request/access logs for broker/historical [puppet] - 10https://gerrit.wikimedia.org/r/477529 [14:53:38] (03PS5) 10Banyek: mariadb: materialized view generator for analytics team [puppet] - 10https://gerrit.wikimedia.org/r/477503 (https://phabricator.wikimedia.org/T210693) [14:56:44] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::php: use unix socket everywhere [puppet] - 10https://gerrit.wikimedia.org/r/477561 [14:57:22] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::mediawiki::php: use unix socket everywhere [puppet] - 10https://gerrit.wikimedia.org/r/477561 (owner: 10Giuseppe Lavagetto) [15:01:32] (03PS2) 10CDanis: Add role::grafana and switch grafana1001.eqiad to it [puppet] - 10https://gerrit.wikimedia.org/r/477557 (https://phabricator.wikimedia.org/T210416) [15:02:32] (03PS1) 10Elukey: druid: add request/access log for middlemanager [puppet] - 10https://gerrit.wikimedia.org/r/477562 [15:02:49] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477563 [15:03:33] (03CR) 10Elukey: [C: 032] druid: add request/access log for middlemanager [puppet] - 10https://gerrit.wikimedia.org/r/477562 (owner: 10Elukey) [15:03:46] PROBLEM - PHP7 rendering on mw2144 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1308 bytes in 0.074 second response time [15:04:38] PROBLEM - PHP7 rendering on mw2169 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1308 bytes in 0.073 second response time [15:05:07] _joe_ --^ [15:05:11] temporary? [15:05:28] <_joe_> elukey: not sure, checking [15:06:02] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477563 (owner: 10Marostegui) [15:07:06] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477563 (owner: 10Marostegui) [15:08:05] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1103:3314 T86338 T202167 (duration: 00m 47s) [15:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:10] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [15:08:10] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [15:08:16] <_joe_> elukey: uhm [15:10:33] 10Operations, 10DBA, 10Availability (MediaWiki-MultiDC), 10Performance-Team (Radar): Investigate solutions for MySQL connection pooling - https://phabricator.wikimedia.org/T196378 (10Marostegui) >>! In T196378#4797793, @Imarlier wrote: > @jcrespo Why would we need to deploy Mediawiki in order to repoint wh... [15:11:13] <_joe_> elukey: that is quite absurd [15:11:45] <_joe_> looks like those two machines specifically ran puppet but somehow didn't get the update of the php-fpm code [15:12:02] <_joe_> this looks like a quite weird race condition [15:12:18] RECOVERY - PHP7 rendering on mw2144 is OK: HTTP OK: HTTP/1.1 200 OK - 74420 bytes in 1.294 second response time [15:12:38] <_joe_> I'm just running puppet on those servers [15:12:47] 10Operations: Usual git mechanism for aborting commit does not work on the private puppet repo - https://phabricator.wikimedia.org/T211121 (10CDanis) [15:12:55] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Onboard at least 10 new non-sensitive log producers to the logging pipeline - https://phabricator.wikimedia.org/T205852 (10herron) [15:12:57] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Ship PuppetDB logs to ELK - https://phabricator.wikimedia.org/T210458 (10herron) 05Open>03Resolved PuppetDB logs have been flowing into logstash for a couple days now. Resolving [15:13:03] _joe_ weird indeed [15:13:49] (03PS1) 10Elukey: druid: create access log only for broker and historical [puppet] - 10https://gerrit.wikimedia.org/r/477564 [15:14:57] (03PS2) 10Elukey: druid: create access log only for broker and historical [puppet] - 10https://gerrit.wikimedia.org/r/477564 [15:15:41] (03CR) 10Elukey: [C: 032] druid: create access log only for broker and historical [puppet] - 10https://gerrit.wikimedia.org/r/477564 (owner: 10Elukey) [15:16:44] RECOVERY - PHP7 rendering on mw2169 is OK: HTTP OK: HTTP/1.1 200 OK - 74418 bytes in 0.315 second response time [15:16:55] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477563 (owner: 10Marostegui) [15:18:22] (03PS1) 10Volans: README: update API documentation [cookbooks] - 10https://gerrit.wikimedia.org/r/477565 (https://phabricator.wikimedia.org/T199079) [15:18:56] (03PS4) 10Volans: cookbook: split main into argument_parser and run [software/spicerack] - 10https://gerrit.wikimedia.org/r/458115 (https://phabricator.wikimedia.org/T199079) [15:19:41] (03CR) 10Volans: "replies inline" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/458115 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:25:10] (03CR) 10Marostegui: mariadb: materialized view generator for analytics team (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/477503 (https://phabricator.wikimedia.org/T210693) (owner: 10Banyek) [15:29:38] (03PS1) 10Alexandros Kosiaris: scb: Fix zotero config typo [puppet] - 10https://gerrit.wikimedia.org/r/477566 (https://phabricator.wikimedia.org/T197242) [15:30:08] (03CR) 10Alexandros Kosiaris: [C: 032] scb: Fix zotero config typo [puppet] - 10https://gerrit.wikimedia.org/r/477566 (https://phabricator.wikimedia.org/T197242) (owner: 10Alexandros Kosiaris) [15:34:10] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:35:28] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:35:44] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:35:48] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:35:58] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:36:12] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:36:36] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:36:36] PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:36:36] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:36:52] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:36:56] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:37:00] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:37:02] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:37:04] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:37:10] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:37:22] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:37:24] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:37:40] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:37:50] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:37:50] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:37:52] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:38:00] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:38:10] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:38:10] PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:38:14] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:38:14] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:38:14] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:38:15] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:38:16] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:38:18] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:38:34] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:38:35] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:38:36] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:38:36] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:38:46] 10Operations, 10Citoid, 10Regression, 10VisualEditor (Current work): Citoid not communicating with Zotero translation-server in production at all - https://phabricator.wikimedia.org/T211114 (10Jdforrester-WMF) Looks fixed now? Though the date format returned is not ideal (see https://en.wikipedia.org/api/... [15:39:04] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:44:25] (03CR) 10Paladox: "Per my chat with @Dzahn, we decided not to support php5 for php-fpm because the php module does most of our work (makes this easier) but o" [puppet] - 10https://gerrit.wikimedia.org/r/476985 (owner: 10Paladox) [15:46:40] 10Operations, 10Citoid, 10Regression, 10VisualEditor (Current work): Citoid not communicating with Zotero translation-server in production at all - https://phabricator.wikimedia.org/T211114 (10Mvolz) Yup looks fixed in the last hour :) [15:47:07] 10Operations, 10Citoid, 10Regression, 10VisualEditor (Current work): Citoid not communicating with Zotero translation-server in production at all - https://phabricator.wikimedia.org/T211114 (10Mvolz) 05Open>03Resolved [15:47:10] 10Operations, 10Citoid, 10Patch-For-Review, 10Services (done), 10VisualEditor (Current work): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 (10Mvolz) [15:47:14] 10Operations, 10Domains, 10Traffic: SOA serial numbers returned by authoritative nameservers differ - https://phabricator.wikimedia.org/T206688 (10Dzahn) 05Resolved>03Open They mailed again with the same stuff as before.. wikimedia.is isn't compliant because the SOAs differ etc.. Then MarkMonitor mailed... [15:48:22] 10Operations, 10Citoid, 10Regression, 10VisualEditor (Current work): Citoid not communicating with Zotero translation-server in production at all - https://phabricator.wikimedia.org/T211114 (10Mvolz) Hmm but the qid one is still not working... [15:50:14] 10Operations, 10Citoid, 10Patch-For-Review, 10Services (done), 10VisualEditor (Current work): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 (10Mvolz) [15:50:19] 10Operations, 10Citoid, 10Regression, 10VisualEditor (Current work): Some regressions in production with Zotero translation-server in production at all - https://phabricator.wikimedia.org/T211114 (10Mvolz) 05Resolved>03Open p:05Unbreak!>03High [15:51:42] 10Operations, 10Citoid, 10Regression, 10VisualEditor (Current work): Some regressions in production with Zotero translation-server in production at all - https://phabricator.wikimedia.org/T211114 (10Mvolz) p:05High>03Normal [15:54:56] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy [15:55:06] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy [15:55:06] RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy [15:55:08] RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy [15:55:12] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [15:55:12] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy [15:55:12] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [15:55:30] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy [15:55:30] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [15:55:34] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [15:55:34] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy [15:55:58] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [15:55:58] RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy [15:56:00] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy [15:56:00] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [15:56:00] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy [15:56:34] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy [15:57:08] 10Operations, 10Domains, 10Traffic: SOA serial numbers returned by authoritative nameservers differ - https://phabricator.wikimedia.org/T206688 (10BBlack) 05Open>03Resolved Fixed again. Copying my whole terminal output for posterity. This runs a readonly command that `md5sum`'s the zones directory to c... [15:57:40] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [15:57:40] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [15:57:50] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [15:57:56] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [15:57:56] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy [15:58:02] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [15:58:16] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [15:58:26] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [15:58:27] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [15:58:30] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [15:58:34] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [15:58:46] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy [15:58:53] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy [15:58:53] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [15:58:54] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [15:58:54] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [15:59:03] what was all this restbase noise? [15:59:25] ~15:34 -> ~15:58 [15:59:56] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy [16:00:00] (03CR) 10CRusnov: [C: 032] README: update API documentation [cookbooks] - 10https://gerrit.wikimedia.org/r/477565 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [16:00:06] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy [16:00:11] bblack: The new citoid service broke the monitoring, so it got switched off. [16:00:20] (The monitoring, not the service.) [16:00:20] (03CR) 10CRusnov: [C: 032] "> Patch Set 1: Code-Review+2" [cookbooks] - 10https://gerrit.wikimedia.org/r/477565 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [16:00:38] PROBLEM - Request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:01:22] (03PS1) 10Ema: cache_upload: hfp on frontends for large objects except for exp [puppet] - 10https://gerrit.wikimedia.org/r/477573 (https://phabricator.wikimedia.org/T144187) [16:01:32] (03PS1) 10Ema: cache: stop using nhw admission policy [puppet] - 10https://gerrit.wikimedia.org/r/477574 (https://phabricator.wikimedia.org/T144187) [16:04:12] RECOVERY - Request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:05:47] (03PS2) 10Ema: cache: stop using nhw admission policy [puppet] - 10https://gerrit.wikimedia.org/r/477574 (https://phabricator.wikimedia.org/T144187) [16:06:56] (03CR) 10CRusnov: [C: 031] "Looks good to me. No major linguistic issues detected." [cookbooks] - 10https://gerrit.wikimedia.org/r/477565 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [16:07:14] (03PS2) 10Volans: README: update API documentation [cookbooks] - 10https://gerrit.wikimedia.org/r/477565 (https://phabricator.wikimedia.org/T199079) [16:12:43] 10Operations, 10MediaWiki-Logging, 10Wikimedia-Logstash: Move mediawiki to new logging infrastructure - https://phabricator.wikimedia.org/T211124 (10fgiunchedi) p:05Triage>03Normal [16:18:14] 10Operations, 10MediaWiki-Logging, 10Wikimedia-Logstash: Move mediawiki to new logging infrastructure - https://phabricator.wikimedia.org/T211124 (10fgiunchedi) I've looked briefly at how to implement prefixing syslog json messages with `@cee:` and I'd say we could do it on the "syslog side" i.e. `./includes... [16:20:44] (03CR) 10Bstorm: [C: 031] "Step in the right direction." [puppet] - 10https://gerrit.wikimedia.org/r/477531 (https://phabricator.wikimedia.org/T211055) (owner: 10Arturo Borrero Gonzalez) [16:21:04] 10Operations, 10Beta-Cluster-Infrastructure: "Obama" page on Beta Cluster often responds with 503 - https://phabricator.wikimedia.org/T188913 (10Niedzielski) [16:22:12] (03PS1) 10CDanis: Copy necessary hieradata from role::webserver_misc_apps to role::grafana [labs/private] - 10https://gerrit.wikimedia.org/r/477579 (https://phabricator.wikimedia.org/T210416) [16:22:20] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [16:22:56] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [16:23:18] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [16:23:18] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [16:23:20] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (Scrapes sample page) timed out before a response was received [16:23:23] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [16:23:36] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [16:23:42] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Scrapes sample page) timed out before a response was received [16:23:54] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [16:24:02] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [16:24:02] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [16:24:08] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [16:24:14] (03PS3) 10Arturo Borrero Gonzalez: toolforge: grid: base: extract variables into hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/477531 (https://phabricator.wikimedia.org/T211055) [16:24:24] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [16:24:30] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [16:24:37] 10Operations, 10Beta-Cluster-Infrastructure: "Obama" page on Beta Cluster often responds with 500 or 503 - https://phabricator.wikimedia.org/T188913 (10Niedzielski) [16:25:02] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [16:25:08] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [16:25:08] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [16:25:11] 10Operations, 10Beta-Cluster-Infrastructure: "Obama" page on Beta Cluster often responds with 500 or 503 - https://phabricator.wikimedia.org/T188913 (10Niedzielski) [16:25:31] (03PS1) 10Anomie: Configure 'api-warning' log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477581 [16:25:40] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [16:25:42] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [16:25:54] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [16:25:57] 10Operations, 10Services, 10Wikimedia-Logstash, 10service-runner: Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 (10fgiunchedi) p:05Triage>03Normal [16:26:00] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy [16:26:00] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [16:26:36] (03CR) 10CDanis: [V: 032 C: 032] Copy necessary hieradata from role::webserver_misc_apps to role::grafana [labs/private] - 10https://gerrit.wikimedia.org/r/477579 (https://phabricator.wikimedia.org/T210416) (owner: 10CDanis) [16:27:30] (03CR) 10Arturo Borrero Gonzalez: [C: 032] toolforge: grid: base: extract variables into hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/477531 (https://phabricator.wikimedia.org/T211055) (owner: 10Arturo Borrero Gonzalez) [16:30:05] 10Operations, 10MediaWiki-Logging, 10Wikimedia-Logstash: Move mediawiki to new logging infrastructure - https://phabricator.wikimedia.org/T211124 (10fgiunchedi) [16:32:11] 10Operations, 10Citoid, 10Regression, 10VisualEditor (Current work): Some regressions in production with Zotero translation-server in production at all - https://phabricator.wikimedia.org/T211114 (10Mvolz) >>! In T211114#4797913, @Jdforrester-WMF wrote: > Looks fixed now? > > Though the date format return... [16:32:24] 10Operations, 10ops-codfw, 10Core Platform Team, 10Services (doing), and 2 others: Reshape RESTBase Cassandra cluster for server refresh - https://phabricator.wikimedia.org/T210843 (10fgiunchedi) [16:33:13] (03PS3) 10CDanis: Add role::grafana and switch grafana1001.eqiad to it [puppet] - 10https://gerrit.wikimedia.org/r/477557 (https://phabricator.wikimedia.org/T210416) [16:35:41] 10Operations, 10Citoid, 10Regression, 10VisualEditor (Current work): The new translation-server returns access date with the full time stamp; we should strip this - https://phabricator.wikimedia.org/T211127 (10Mvolz) p:05Triage>03Normal [16:37:59] (03PS4) 10CDanis: Add role::grafana and switch grafana1001.eqiad to it [puppet] - 10https://gerrit.wikimedia.org/r/477557 (https://phabricator.wikimedia.org/T210416) [16:38:01] (03PS3) 10Daimona Eaytoy: Move all AbuseFilter config to abusefilter.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477063 (https://phabricator.wikimedia.org/T145931) [16:38:08] (03PS5) 10Daimona Eaytoy: Update AbuseFilter config to keep the status quo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 [16:38:21] (03PS3) 10Cwhite: Add prometheus cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/477366 (https://phabricator.wikimedia.org/T210486) [16:38:40] (03CR) 10CDanis: "Puppet-compiler diffs as expected" [puppet] - 10https://gerrit.wikimedia.org/r/477557 (https://phabricator.wikimedia.org/T210416) (owner: 10CDanis) [16:38:59] (03CR) 10jerkins-bot: [V: 04-1] Update AbuseFilter config to keep the status quo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 (owner: 10Daimona Eaytoy) [16:43:24] 10Operations, 10Citoid, 10Regression, 10VisualEditor (Current work): The new translation-server returns access date with the full time stamp; we should strip this - https://phabricator.wikimedia.org/T211127 (10Mvolz) @Jdforrester-WMF Unfortunately I'm going to be on an aeroplane both today and tomorrow and... [16:43:38] 10Operations, 10Citoid, 10Regression, 10VisualEditor (Current work): The new translation-server returns access date with the full time stamp; we should strip this - https://phabricator.wikimedia.org/T211127 (10Mvolz) p:05Normal>03Unbreak! [16:44:15] (03CR) 10Filippo Giunchedi: [C: 032] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/477557 (https://phabricator.wikimedia.org/T210416) (owner: 10CDanis) [16:44:28] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [16:45:13] (03CR) 10BPirkle: [C: 031] "Looks good to deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477581 (owner: 10Anomie) [16:45:36] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [16:45:58] (03PS1) 10Fsero: (WIP) local puppet compiler docker-compose [puppet] - 10https://gerrit.wikimedia.org/r/477583 [16:47:04] (03CR) 10Anomie: [C: 032] "Deploying config change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477581 (owner: 10Anomie) [16:47:46] (03Merged) 10jenkins-bot: Configure 'api-warning' log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477581 (owner: 10Anomie) [16:47:49] (03CR) 10Fsero: [V: 04-1 C: 04-2] "DO NOT merge" [puppet] - 10https://gerrit.wikimedia.org/r/477583 (owner: 10Fsero) [16:48:01] (03CR) 10jenkins-bot: Configure 'api-warning' log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477581 (owner: 10Anomie) [16:48:54] !log anomie@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Configure 'api-warning' log channel (duration: 00m 47s) [16:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:38] is there any way to recover the file on https://commons.wikimedia.org/wiki/Special:UploadStash ? [16:54:10] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "See a few comments inline; overall the patch seems to go in the right direction." (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/476985 (owner: 10Paladox) [16:54:15] it is a 450 MB file, so if I can avoid to reupload, it would be nice ;oS [16:55:05] there is a publish button, but I get an error [16:55:36] (03PS1) 10Elukey: druid: absent some crons to purge logs [puppet] - 10https://gerrit.wikimedia.org/r/477586 [16:56:08] (03CR) 10jerkins-bot: [V: 04-1] druid: absent some crons to purge logs [puppet] - 10https://gerrit.wikimedia.org/r/477586 (owner: 10Elukey) [16:56:50] 10Operations, 10Citoid, 10Regression, 10VisualEditor (Current work): Some regressions in production with Zotero translation-server in production at all - https://phabricator.wikimedia.org/T211114 (10mobrovac) 05Open>03Resolved a:03akosiaris There was a communication problem both between Citoid and Zo... [16:56:53] 10Operations, 10Citoid, 10Patch-For-Review, 10Services (done), 10VisualEditor (Current work): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 (10mobrovac) [16:57:35] 10Operations, 10Wikimedia-Logstash, 10service-runner, 10Core Platform Team Backlog (Next), 10Services (next): Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 (10mobrovac) [16:59:30] (03PS1) 10Paladox: httpd: Add php_version variable to httpd::mpm [puppet] - 10https://gerrit.wikimedia.org/r/477587 [16:59:44] (03PS1) 10Banyek: mariadb: depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477588 (https://phabricator.wikimedia.org/T85757) [16:59:55] (03PS2) 10Elukey: druid: absent some crons to purge logs [puppet] - 10https://gerrit.wikimedia.org/r/477586 [17:00:04] godog and _joe_: Dear deployers, time to do the Puppet SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181204T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:00:28] (03PS2) 10Paladox: httpd: Add php_version variable to httpd::mpm [puppet] - 10https://gerrit.wikimedia.org/r/477587 [17:01:36] (03CR) 10Elukey: [C: 032] druid: absent some crons to purge logs [puppet] - 10https://gerrit.wikimedia.org/r/477586 (owner: 10Elukey) [17:01:42] (03PS1) 10Banyek: mariadb: depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477589 (https://phabricator.wikimedia.org/T85757) [17:02:00] (03PS1) 10Banyek: mariadb: depool db1097:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477590 (https://phabricator.wikimedia.org/T85757) [17:02:23] (03CR) 10Dzahn: [C: 031] "ah. you are moving it to a dedicated server. lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/477557 (https://phabricator.wikimedia.org/T210416) (owner: 10CDanis) [17:02:27] (03PS1) 10Banyek: mariadb: depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477591 (https://phabricator.wikimedia.org/T85757) [17:02:35] <_joe_> paladox: just slap both versions in the title, no need for an added parameter [17:02:47] (03PS1) 10Banyek: mariadb: depool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477592 (https://phabricator.wikimedia.org/T85757) [17:03:07] (03PS1) 10Banyek: mariadb: depool db1113:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477593 (https://phabricator.wikimedia.org/T85757) [17:03:13] _joe_ title? [17:03:26] <_joe_> httpd:mod_conf { ['php5', 'php7.0', 'php7.2']: ensure => absent } [17:03:33] <_joe_> or something along those lines [17:03:41] ah [17:03:42] i see [17:03:44] thanks! [17:03:56] <_joe_> it's simpler :) [17:04:28] (03CR) 10jerkins-bot: [V: 04-1] mariadb: depool db1113:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477593 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [17:05:01] (03PS3) 10Paladox: httpd::mpm: Add php7.0 and php7.2 [puppet] - 10https://gerrit.wikimedia.org/r/477587 [17:05:37] (03PS2) 10Dzahn: wikistats: fix xml dump cron jobs by specifying defaults-extra-file [puppet] - 10https://gerrit.wikimedia.org/r/477451 (https://phabricator.wikimedia.org/T200447) [17:05:49] (03CR) 10Dzahn: [C: 032] wikistats: fix xml dump cron jobs by specifying defaults-extra-file [puppet] - 10https://gerrit.wikimedia.org/r/477451 (https://phabricator.wikimedia.org/T200447) (owner: 10Dzahn) [17:07:00] _joe_ https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/476985/23/modules/profile/manifests/phabricator/main.pp (for the 10secs thing) which do you recommend? :) [17:07:43] (03PS2) 10Banyek: mariadb: depool db1113:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477593 (https://phabricator.wikimedia.org/T85757) [17:10:28] (03PS24) 10Paladox: phabricator: Add support for php-fpm in stretch [puppet] - 10https://gerrit.wikimedia.org/r/476985 [17:10:30] (03CR) 10Paladox: phabricator: Add support for php-fpm in stretch (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/476985 (owner: 10Paladox) [17:11:01] (03PS4) 10Cwhite: Add prometheus cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/477366 (https://phabricator.wikimedia.org/T210486) [17:11:12] (03CR) 10jerkins-bot: [V: 04-1] phabricator: Add support for php-fpm in stretch [puppet] - 10https://gerrit.wikimedia.org/r/476985 (owner: 10Paladox) [17:12:28] (03PS25) 10Paladox: phabricator: Add support for php-fpm in stretch [puppet] - 10https://gerrit.wikimedia.org/r/476985 [17:13:22] (03PS26) 10Paladox: phabricator: Add support for php-fpm in stretch [puppet] - 10https://gerrit.wikimedia.org/r/476985 [17:13:35] (03CR) 10Filippo Giunchedi: [C: 031] Add prometheus cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/477366 (https://phabricator.wikimedia.org/T210486) (owner: 10Cwhite) [17:16:10] !log bootstrap cassandra-c on restbase2015 - T210843 [17:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:14] T210843: Reshape RESTBase Cassandra cluster for server refresh - https://phabricator.wikimedia.org/T210843 [17:22:19] !log created oathauth tables on punjabiwikimedia T211110 [17:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:22] T211110: Logging in on punjabiwikimedia throws fatal error - https://phabricator.wikimedia.org/T211110 [17:23:52] RECOVERY - Disk space on notebook1004 is OK: DISK OK [17:27:28] (03CR) 10Marostegui: [C: 04-1] mariadb: depool db1082 (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477588 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [17:28:04] (03CR) 10Marostegui: mariadb: depool db1113:3315 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477593 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [17:29:03] (03CR) 10Marostegui: [C: 04-1] mariadb: depool db1110 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477592 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [17:29:24] (03CR) 10Marostegui: [C: 031] mariadb: depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477591 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [17:29:43] (03CR) 10Marostegui: [C: 031] mariadb: depool db1097:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477590 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [17:29:59] (03CR) 10Marostegui: [C: 031] mariadb: depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477589 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [17:33:59] (03PS1) 10Paladox: phabricator: Increase `max_execution_time` to 30 [puppet] - 10https://gerrit.wikimedia.org/r/477595 [17:34:20] (03PS2) 10Paladox: phabricator: Increase `max_execution_time` to 30 [puppet] - 10https://gerrit.wikimedia.org/r/477595 [17:34:51] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/476985 (owner: 10Paladox) [17:35:18] (03CR) 10jerkins-bot: [V: 04-1] phabricator: Increase `max_execution_time` to 30 [puppet] - 10https://gerrit.wikimedia.org/r/477595 (owner: 10Paladox) [17:36:31] (03PS3) 10Paladox: phabricator: Increase 'max_execution_time' to 30 [puppet] - 10https://gerrit.wikimedia.org/r/477595 [17:37:29] (03CR) 10jerkins-bot: [V: 04-1] phabricator: Increase 'max_execution_time' to 30 [puppet] - 10https://gerrit.wikimedia.org/r/477595 (owner: 10Paladox) [17:38:06] (03PS4) 10Paladox: phabricator: Increase 'max_execution_time' to 30 [puppet] - 10https://gerrit.wikimedia.org/r/477595 [17:38:20] (03PS5) 10Paladox: phabricator: Increase 'max_execution_time' to 30 [puppet] - 10https://gerrit.wikimedia.org/r/477595 [17:39:43] 10Operations, 10Wikimedia-IRC-RC-Server, 10Patch-For-Review: Replace ircd-ratbox with something newer/maintained - https://phabricator.wikimedia.org/T134271 (10Dzahn) 05Open>03stalled [17:40:39] 10Operations, 10Patch-For-Review: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562 (10Dzahn) [17:42:52] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [17:43:06] PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [17:43:32] 10Operations, 10Traffic: DNS recursors TCP retransmits - https://phabricator.wikimedia.org/T211131 (10ayounsi) p:05Triage>03Normal [17:43:58] 10Operations, 10Patch-For-Review: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562 (10Dzahn) 05Open>03Resolved all subtasks here are resolved, the specific ones mentioned were rdb1005/1006 (closed) and lvs2009/2010 (setup in progress). We still want T206131 to add monito... [17:44:14] RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy [17:44:36] 10Operations, 10Traffic: DNS recursors TCP retransmits - https://phabricator.wikimedia.org/T211131 (10ayounsi) [17:45:12] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy [17:45:23] !log ppchelko@deploy1001 Started deploy [changeprop/deploy@e1aeb27]: Do not initialize scores and errors arrays in advance T210465 [17:45:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:26] T210465: Refinery Spark HiveExtensions schema merge should support merging of arrays with struct elements - https://phabricator.wikimedia.org/T210465 [17:45:44] PROBLEM - MariaDB Slave Lag: s5 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.34 seconds [17:46:02] (03PS1) 10Andrew Bogott: Horizon: cleaned up rules for a bunch of deleted projects. [puppet] - 10https://gerrit.wikimedia.org/r/477597 [17:46:36] !log ppchelko@deploy1001 Finished deploy [changeprop/deploy@e1aeb27]: Do not initialize scores and errors arrays in advance T210465 (duration: 01m 13s) [17:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:13] !log joal@deploy1001 Started deploy [analytics/aqs/deploy@e7d48e9]: Add underestimate and offset to uniques-devices endpoint [17:47:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:30] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [17:52:34] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [17:52:50] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [17:53:20] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (Scrapes sample page) timed out before a response was received [17:53:36] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [17:53:40] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy [17:53:58] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy [17:54:08] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [17:55:18] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [17:55:20] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [17:55:26] (03PS2) 10Andrew Bogott: Horizon: cleaned up rules for a bunch of deleted projects [puppet] - 10https://gerrit.wikimedia.org/r/477597 [17:55:28] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [17:56:10] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [17:56:50] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [17:57:14] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [17:57:36] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [17:57:38] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy [17:57:38] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [17:57:44] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [17:59:17] (03CR) 10Andrew Bogott: [C: 032] Horizon: cleaned up rules for a bunch of deleted projects [puppet] - 10https://gerrit.wikimedia.org/r/477597 (owner: 10Andrew Bogott) [17:59:22] akosiaris: this seems to be zotero v2 struggling ^ [18:00:04] cscott, arlolra, subbu, halfak, and Amir1: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Graphoid / Parsoid / Citoid / ORES . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181204T1800). [18:00:26] (03PS1) 10Dzahn: remove graphite2001/2002 from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/477599 (https://phabricator.wikimedia.org/T199321) [18:00:27] No deploy [18:00:31] (for ORES) [18:01:05] (03CR) 10Cwhite: [C: 032] Add prometheus cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/477366 (https://phabricator.wikimedia.org/T210486) (owner: 10Cwhite) [18:01:13] (03PS5) 10Cwhite: Add prometheus cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/477366 (https://phabricator.wikimedia.org/T210486) [18:01:16] (03CR) 10Dzahn: [C: 032] "decom. if these come back then with different names and not using jessie..so going ahead here" [puppet] - 10https://gerrit.wikimedia.org/r/477599 (https://phabricator.wikimedia.org/T199321) (owner: 10Dzahn) [18:01:30] (03PS2) 10Dzahn: remove graphite2001/2002 from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/477599 (https://phabricator.wikimedia.org/T199321) [18:02:31] mobrovac: indeed I 'll increase the number of pods [18:03:24] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [18:03:35] 10Operations, 10hardware-requests: eqiad: (1) hardware access request for Analytics Cloudb replica - https://phabricator.wikimedia.org/T211135 (10elukey) [18:03:58] !log bump zotero pod number from 4 to 16 in eqiad/codfw [18:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:34] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy [18:04:46] !log joal@deploy1001 Finished deploy [analytics/aqs/deploy@e7d48e9]: Add underestimate and offset to uniques-devices endpoint (duration: 17m 33s) [18:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:07] (03PS3) 10Dzahn: remove graphite2001/2002 from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/477599 (https://phabricator.wikimedia.org/T199321) [18:11:04] 10Operations, 10decommission, 10hardware-requests, 10Patch-For-Review, 10User-fgiunchedi: Return graphite200[12] to spares pool - https://phabricator.wikimedia.org/T199321 (10Dzahn) [18:11:52] 10Operations, 10decommission, 10hardware-requests, 10Patch-For-Review, 10User-fgiunchedi: Return graphite200[12] to spares pool - https://phabricator.wikimedia.org/T199321 (10Dzahn) [18:12:23] (03PS2) 10Cwhite: Add graphite cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/477367 (https://phabricator.wikimedia.org/T210486) [18:12:47] 10Operations, 10decommission, 10hardware-requests, 10User-fgiunchedi: Return graphite200[12] to spares pool - https://phabricator.wikimedia.org/T199321 (10Dzahn) [18:13:03] (03CR) 10Cwhite: [C: 032] Add graphite cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/477367 (https://phabricator.wikimedia.org/T210486) (owner: 10Cwhite) [18:13:43] 10Operations, 10decommission, 10hardware-requests, 10Patch-For-Review, 10User-fgiunchedi: Return graphite100[13] to spares pool (or decom) - https://phabricator.wikimedia.org/T209357 (10Dzahn) [18:15:23] RECOVERY - MariaDB Slave Lag: s5 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 48.39 seconds [18:16:12] (03PS1) 10Dzahn: remove graphite1001, graphite1003 from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/477601 (https://phabricator.wikimedia.org/T209357) [18:18:28] (03PS1) 10Dzahn: switch graphite host for dev_cluster from graphite1003 to graphite1004 [puppet] - 10https://gerrit.wikimedia.org/r/477602 (https://phabricator.wikimedia.org/T209357) [18:20:54] akosiaris: i think that did the trick! citoid doesn't seem to be flapping any more [18:20:55] thnx! [18:21:01] (03PS1) 10Dzahn: switch graphite host for prod cassandra to graphite1004 [puppet] - 10https://gerrit.wikimedia.org/r/477604 (https://phabricator.wikimedia.org/T209357) [18:21:11] (03PS1) 10Ottomata: Re-enable revision-score refinement (again) [puppet] - 10https://gerrit.wikimedia.org/r/477605 (https://phabricator.wikimedia.org/T210465) [18:21:59] (03CR) 10Ottomata: [C: 032] Re-enable revision-score refinement (again) [puppet] - 10https://gerrit.wikimedia.org/r/477605 (https://phabricator.wikimedia.org/T210465) (owner: 10Ottomata) [18:22:41] 10Operations, 10Pybal, 10Traffic: DNS recursors TCP retransmits - https://phabricator.wikimedia.org/T211131 (10ayounsi) p:05Normal>03Low Doing a DNS query over TCP from bast2001 to dns2001 (directly) or dns2002 (via the LVS VIP `dig @dns-rec-lb.codfw.wikimedia.org en.wikipedia.org +tcp`) doesn't show any... [18:23:03] 10Operations, 10decommission, 10hardware-requests, 10Patch-For-Review, 10User-fgiunchedi: Return graphite100[13] to spares pool (or decom) - https://phabricator.wikimedia.org/T209357 (10Dzahn) we should also switch the state of these spare graphite hosts in netbox [18:24:55] 10Operations, 10monitoring: Add monitoring for nutcracker - https://phabricator.wikimedia.org/T95231 (10Dzahn) [18:28:12] 10Operations, 10Pybal, 10Traffic: DNS recursors TCP retransmits - https://phabricator.wikimedia.org/T211131 (10ayounsi) [18:39:32] 10Operations, 10DBA, 10Gerrit: Convert Gerrit's to use H2 as the database after 2.16 upgrade - https://phabricator.wikimedia.org/T211139 (10Paladox) [18:40:03] (03PS1) 10Arturo Borrero Gonzalez: hieradata: delete toolforge-specific keys [puppet] - 10https://gerrit.wikimedia.org/r/477607 [18:41:01] (03CR) 10Arturo Borrero Gonzalez: [C: 032] hieradata: delete toolforge-specific keys [puppet] - 10https://gerrit.wikimedia.org/r/477607 (owner: 10Arturo Borrero Gonzalez) [18:41:12] 10Operations, 10DBA, 10Gerrit: Convert Gerrit to use H2 as the database after 2.16 upgrade - https://phabricator.wikimedia.org/T211139 (10Paladox) [18:41:20] 10Operations, 10DBA, 10Gerrit: Convert Gerrit to use H2 as the database after 2.16 upgrade - https://phabricator.wikimedia.org/T211139 (10Paladox) [18:42:31] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Onboard at least 10 new non-sensitive log producers to the logging pipeline - https://phabricator.wikimedia.org/T205852 (10Paladox) [18:42:50] thanks paladox! [18:42:57] your welcome! :) [18:48:27] (03PS2) 10Arturo Borrero Gonzalez: toolforge: grid: introduce systemd service file for sge_qmaster [puppet] - 10https://gerrit.wikimedia.org/r/477554 (https://phabricator.wikimedia.org/T211055) [18:48:53] 10Operations, 10Gerrit: Convert Gerrit to use H2 as the database after 2.16 upgrade - https://phabricator.wikimedia.org/T211139 (10Marostegui) [18:50:14] (03CR) 10Arturo Borrero Gonzalez: [C: 032] toolforge: grid: introduce systemd service file for sge_qmaster [puppet] - 10https://gerrit.wikimedia.org/r/477554 (https://phabricator.wikimedia.org/T211055) (owner: 10Arturo Borrero Gonzalez) [18:51:27] godog: do you have examples to point to about how the UDP packet payload should look? I'm not finding a really clear example of CEE cookie usage that references an RFC. [18:56:09] 10Operations, 10MediaWiki-Logging, 10Wikimedia-Logstash: Move mediawiki to new logging infrastructure - https://phabricator.wikimedia.org/T211124 (10bd808) >>! In T211124#4798013, @fgiunchedi wrote: > I've looked briefly at how to implement prefixing syslog json messages with `@cee:` and I'd say we could do... [18:57:28] PROBLEM - DPKG on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [18:57:36] PROBLEM - Disk space on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [18:57:48] PROBLEM - Check systemd state on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [18:57:50] PROBLEM - configured eth on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [18:58:18] (03PS2) 10Bstorm: sonofgridengine: remove weird accounting link [puppet] - 10https://gerrit.wikimedia.org/r/477446 (https://phabricator.wikimedia.org/T200557) [18:59:22] PROBLEM - MD RAID on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [19:01:16] PROBLEM - dhclient process on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [19:02:10] PROBLEM - puppet last run on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [19:02:36] (03CR) 10Bstorm: [C: 032] sonofgridengine: remove weird accounting link [puppet] - 10https://gerrit.wikimedia.org/r/477446 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [19:05:48] 10Operations, 10Wikimedia-Logstash: Investigate approaches to ingest sensitive log producers - https://phabricator.wikimedia.org/T205855 (10herron) [19:05:53] 10Operations, 10Gerrit: Convert Gerrit to use H2 as the database after 2.16 upgrade - https://phabricator.wikimedia.org/T211139 (10Dzahn) On one hand i would love this because it would make the gerrit codfw slave work which is blocked to lack of misc mysql cluster there. (because of that the DBA tag wasn't wro... [19:07:34] 10Operations, 10Gerrit: Convert Gerrit to use H2 as the database after 2.16 upgrade - https://phabricator.wikimedia.org/T211139 (10Paladox) Those Docs are based on using H2 fully, by that i mean not using NoteDB. In 2.16 it would be a single table called "schema_version" that is used by the gerrit init when do... [19:08:19] 10Operations, 10Wikimedia-Logstash: Investigate approaches to ingest sensitive log producers - https://phabricator.wikimedia.org/T205855 (10herron) [19:09:42] RECOVERY - MD RAID on notebook1004 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [19:10:00] RECOVERY - DPKG on notebook1004 is OK: All packages OK [19:10:08] RECOVERY - Disk space on notebook1004 is OK: DISK OK [19:10:24] RECOVERY - Check systemd state on notebook1004 is OK: OK - running: The system is fully operational [19:10:24] RECOVERY - configured eth on notebook1004 is OK: OK - interfaces up [19:10:32] RECOVERY - dhclient process on notebook1004 is OK: PROCS OK: 0 processes with command name dhclient [19:12:30] RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:16:53] 10Operations, 10Wikimedia-Logstash: Investigate approaches to ingest sensitive log producers - https://phabricator.wikimedia.org/T205855 (10herron) [19:19:46] 10Operations, 10Wikimedia-Logstash: Investigate approaches to ingest sensitive log producers - https://phabricator.wikimedia.org/T205855 (10herron) Updated the description to outline several possible approaches. To me option #2 stands out as worthy of a POC. [19:24:53] !log smalyshev@deploy1001 Started deploy [wdqs/wdqs@81dac18]: Install new Updater for T210044 investigation [19:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:57] T210044: Data corruption when loading RDF data into WDQS - https://phabricator.wikimedia.org/T210044 [19:31:26] 10Operations, 10Performance-Team, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: Errors trying to fetch RDF from Wikidata - https://phabricator.wikimedia.org/T207718 (10Smalyshev) I've reduced the pool lifetime to 1s (which should be essentially as if there was no pooling if I get it correctly), let's... [19:32:28] (03PS2) 10Dzahn: remove graphite1001, graphite1003 from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/477601 (https://phabricator.wikimedia.org/T209357) [19:33:51] (03CR) 10Cwhite: [C: 031] remove graphite1001, graphite1003 from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/477601 (https://phabricator.wikimedia.org/T209357) (owner: 10Dzahn) [19:33:54] (03CR) 10Dzahn: [C: 032] "if they get reclaimed they would still have other names and not be on jessie.. so it's an edit one way or another" [puppet] - 10https://gerrit.wikimedia.org/r/477601 (https://phabricator.wikimedia.org/T209357) (owner: 10Dzahn) [19:35:30] !log smalyshev@deploy1001 Finished deploy [wdqs/wdqs@81dac18]: Install new Updater for T210044 investigation (duration: 10m 36s) [19:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:33] T210044: Data corruption when loading RDF data into WDQS - https://phabricator.wikimedia.org/T210044 [19:37:18] (03PS1) 10Cwhite: wmcs: add prometheus-memcached-exporter [puppet] - 10https://gerrit.wikimedia.org/r/477620 (https://phabricator.wikimedia.org/T147326) [19:38:23] (03PS1) 10Smalyshev: Enable SPARQL logging for internal & production [puppet] - 10https://gerrit.wikimedia.org/r/477621 [19:38:46] (03CR) 10jerkins-bot: [V: 04-1] Enable SPARQL logging for internal & production [puppet] - 10https://gerrit.wikimedia.org/r/477621 (owner: 10Smalyshev) [19:39:26] (03PS1) 10Smalyshev: Disable SPARQL logging [puppet] - 10https://gerrit.wikimedia.org/r/477622 [19:39:53] (03CR) 10jerkins-bot: [V: 04-1] Disable SPARQL logging [puppet] - 10https://gerrit.wikimedia.org/r/477622 (owner: 10Smalyshev) [19:40:45] (03PS2) 10Smalyshev: Enable SPARQL logging for internal & production [puppet] - 10https://gerrit.wikimedia.org/r/477621 [19:40:48] (03PS2) 10Smalyshev: Disable SPARQL logging [puppet] - 10https://gerrit.wikimedia.org/r/477622 [19:45:41] (03PS1) 10Bstorm: wiki replicas: depool labsdb1010 for testing materialized view [puppet] - 10https://gerrit.wikimedia.org/r/477624 (https://phabricator.wikimedia.org/T210693) [19:46:38] (03CR) 10Cwhite: "Changes look good: https://puppet-compiler.wmflabs.org/compiler1002/13837/" [puppet] - 10https://gerrit.wikimedia.org/r/477620 (https://phabricator.wikimedia.org/T147326) (owner: 10Cwhite) [19:50:44] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [19:51:50] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy [19:53:36] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [19:54:40] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [19:56:01] (03CR) 10Dzahn: [C: 032] "convenience CNAME for humans to not have to remember numbers and keep counting up from 1001" [dns] - 10https://gerrit.wikimedia.org/r/476330 (owner: 10Dzahn) [19:57:24] 10Operations, 10ops-codfw, 10Core Platform Team, 10Services (doing), and 2 others: Reshape RESTBase Cassandra cluster for server refresh - https://phabricator.wikimedia.org/T210843 (10Eevans) [19:57:35] (03PS3) 10Dzahn: add maintenance.eqiad CNAME, point to mwmaint1002 [dns] - 10https://gerrit.wikimedia.org/r/476330 [19:58:15] (03PS4) 10Dzahn: add maintenance.eqiad CNAME, point to mwmaint1002 [dns] - 10https://gerrit.wikimedia.org/r/476330 [19:58:35] (03CR) 10Dzahn: [C: 032] add maintenance.eqiad CNAME, point to mwmaint1002 [dns] - 10https://gerrit.wikimedia.org/r/476330 (owner: 10Dzahn) [20:01:47] (03CR) 10Dzahn: [C: 04-1] "https://puppet-compiler.wmflabs.org/compiler1002/47/" [puppet] - 10https://gerrit.wikimedia.org/r/471325 (owner: 10Dzahn) [20:03:03] (03PS5) 10CDanis: Add role::grafana and switch grafana1001.eqiad to it [puppet] - 10https://gerrit.wikimedia.org/r/477557 (https://phabricator.wikimedia.org/T210416) [20:03:34] (03CR) 10CDanis: [V: 032] Add role::grafana and switch grafana1001.eqiad to it [puppet] - 10https://gerrit.wikimedia.org/r/477557 (https://phabricator.wikimedia.org/T210416) (owner: 10CDanis) [20:03:54] RECOVERY - cassandra-a SSL 10.192.32.108:7001 on restbase2016 is OK: SSL OK - Certificate restbase2016-a valid until 2020-11-29 09:26:14 +0000 (expires in 725 days) [20:03:58] RECOVERY - cassandra-a service on restbase2016 is OK: OK - cassandra-a is active [20:04:33] !log bootstrapping cassandra-a, restbase2016 -- T210843 [20:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:37] T210843: Reshape RESTBase Cassandra cluster for server refresh - https://phabricator.wikimedia.org/T210843 [20:07:06] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 616.18 seconds [20:08:03] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): tools-k8s-master-01 has two floating IPs - https://phabricator.wikimedia.org/T164123 (10GTirloni) This was discussed today and the best guess is that some tenant (possibly PAWS) wanted to expose an endpoint publicly but either ended up abandonin... [20:08:20] 10Operations, 10Toolforge, 10cloud-services-team (Kanban): tools-k8s-master-01 has two floating IPs - https://phabricator.wikimedia.org/T164123 (10GTirloni) 05Open>03Resolved [20:11:05] (03CR) 10Dzahn: [C: 04-1] "eh.. why does it fail about modules/role/manifests/phabricator.pp . i don't change the role class in here?" [puppet] - 10https://gerrit.wikimedia.org/r/471325 (owner: 10Dzahn) [20:13:12] 10Operations, 10Citoid, 10Regression, 10VisualEditor (Current work): Some regressions in production with Zotero translation-server in production at all - https://phabricator.wikimedia.org/T211114 (10Mvolz) The qid issue is still ongoing; I suspect that Zotero is crashing with that request. [20:15:32] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [20:16:40] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [20:16:42] PROBLEM - Check systemd state on grafana1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:16:48] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [20:17:12] PROBLEM - puppet last run on grafana1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 10 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/grafana/ldap.toml] [20:17:28] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [20:18:40] ACKNOWLEDGEMENT - Check systemd state on grafana1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T210416 [20:18:40] ACKNOWLEDGEMENT - puppet last run on grafana1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 10 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/grafana/ldap.toml] daniel_zahn https://phabricator.wikimedia.org/T210416 [20:18:41] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [20:19:24] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [20:19:40] 10Operations, 10decommission, 10hardware-requests, 10Patch-For-Review, 10User-fgiunchedi: Return graphite100[13] to spares pool (or decom) - https://phabricator.wikimedia.org/T209357 (10colewhite) [20:19:42] 10Operations, 10decommission, 10hardware-requests, 10User-fgiunchedi: Return graphite200[12] to spares pool - https://phabricator.wikimedia.org/T199321 (10colewhite) [20:20:00] 10Operations, 10Citoid, 10Patch-For-Review, 10Services (done), 10VisualEditor (Current work): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 (10Pchelolo) Seems like after this has been done the citation alerts started flapping much more then they used... [20:20:43] hmm [20:20:48] looking at puppet logs on grafana1001 now [20:21:46] RECOVERY - Check systemd state on grafana1001 is OK: OK - running: The system is fully operational [20:22:12] RECOVERY - puppet last run on grafana1001 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [20:25:28] mutante: thanks for the ack [20:25:38] PROBLEM - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 513 bytes in 0.031 second response time [20:25:42] reran puppet by hand and everything was happy [20:26:14] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [20:26:55] gtirloni: suppose that k8s alert is pining for the floating IP? [20:27:14] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [20:27:46] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [20:28:44] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy [20:31:19] cdanis: welcome:) [20:31:33] it's normal when adding new hosts with new role [20:31:46] cant schedule downtime before they exist [20:31:57] andrewbogott: i'm checking [20:31:58] is "something goes wrong with puppet the first run through" also normal? [20:32:04] thanks! [20:32:25] cdanis: normally not. but depends on the role. some things only work after 2 runs [20:32:32] heh [20:33:39] is it about not finding stuff in Hiera? [20:33:45] that is also added in the same patch [20:34:29] Dec 4 20:06:22 grafana1001 puppet-agent[27200]: (/Stage[main]/Grafana/File[/etc/grafana/ldap.toml]/ensure) change from absent to file failed: Could not set 'file' on ensure: No such file or directory @ dir_s_mkdir - /etc/grafana/ldap.toml20181204-27200-9detc.lock at /etc/puppet/modules/grafana/manifests/init.pp:91 [20:35:35] (03PS1) 10Effie Mouzeli: nagios: Add halfak in team-scoring contact group [puppet] - 10https://gerrit.wikimedia.org/r/477629 (https://phabricator.wikimedia.org/T210742) [20:36:07] ACKNOWLEDGEMENT - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 513 bytes in 0.034 second response time GTirloni Floating IPs removed from k8s-master-01 [20:36:46] cdanis: i see the file is there now [20:36:56] indeed, it worked correctly when rerunning puppet [20:37:00] cdanis: so that would be a puppet issue in the grafana role then [20:37:17] but a minor one because it works from second run, ack [20:38:43] the file resource for ldap.toml would need to be there first [20:38:47] there is already " before => Service['grafana-server'], [20:39:12] for the file resource for ldap.toml.. but apparently that isn't enough [20:40:44] i think i know what happened [20:41:05] (03CR) 10Dzahn: [C: 031] nagios: Add halfak in team-scoring contact group [puppet] - 10https://gerrit.wikimedia.org/r/477629 (https://phabricator.wikimedia.org/T210742) (owner: 10Effie Mouzeli) [20:41:07] yes [20:41:55] okay, the problem is that /etc/grafana did not exist yet, because ldap.toml doesn't require => Package['grafana'], like /etc/grafana/grafana.ini does. [20:42:24] I will fix [20:42:38] (03PS15) 10Paladox: phabricator: add data types to all parameters [puppet] - 10https://gerrit.wikimedia.org/r/471325 (owner: 10Dzahn) [20:42:44] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/471325 (owner: 10Dzahn) [20:44:26] (03PS16) 10Paladox: phabricator: add data types to all parameters [puppet] - 10https://gerrit.wikimedia.org/r/471325 (owner: 10Dzahn) [20:44:40] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/471325 (owner: 10Dzahn) [20:46:28] cdanis: :) perfect [20:46:34] (03CR) 10Effie Mouzeli: [C: 032] nagios: Add halfak in team-scoring contact group [puppet] - 10https://gerrit.wikimedia.org/r/477629 (https://phabricator.wikimedia.org/T210742) (owner: 10Effie Mouzeli) [20:46:46] (03PS3) 10Gehel: Enable SPARQL logging for internal & production [puppet] - 10https://gerrit.wikimedia.org/r/477621 (owner: 10Smalyshev) [20:47:20] (03PS1) 10CDanis: Fix race condition in ::grafana puppet module [puppet] - 10https://gerrit.wikimedia.org/r/477632 (https://phabricator.wikimedia.org/T210416) [20:47:30] (03CR) 10Gehel: [C: 032] Enable SPARQL logging for internal & production [puppet] - 10https://gerrit.wikimedia.org/r/477621 (owner: 10Smalyshev) [20:47:56] (03PS4) 10Gehel: Enable SPARQL logging for internal & production [puppet] - 10https://gerrit.wikimedia.org/r/477621 (owner: 10Smalyshev) [20:49:05] (03CR) 10Dzahn: [C: 031] Fix race condition in ::grafana puppet module [puppet] - 10https://gerrit.wikimedia.org/r/477632 (https://phabricator.wikimedia.org/T210416) (owner: 10CDanis) [20:59:08] (03PS17) 10Paladox: phabricator: add data types to all parameters [puppet] - 10https://gerrit.wikimedia.org/r/471325 (owner: 10Dzahn) [20:59:13] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/471325 (owner: 10Dzahn) [21:02:05] (03PS18) 10Paladox: phabricator: add data types to all parameters [puppet] - 10https://gerrit.wikimedia.org/r/471325 (owner: 10Dzahn) [21:02:12] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/471325 (owner: 10Dzahn) [21:07:00] (03CR) 10CDanis: "https://puppet-compiler.wmflabs.org/compiler1002/13840/" [puppet] - 10https://gerrit.wikimedia.org/r/477632 (https://phabricator.wikimedia.org/T210416) (owner: 10CDanis) [21:07:08] (03PS2) 10CDanis: Fix race condition in ::grafana puppet module [puppet] - 10https://gerrit.wikimedia.org/r/477632 (https://phabricator.wikimedia.org/T210416) [21:09:46] (03CR) 10CDanis: [C: 032] Fix race condition in ::grafana puppet module [puppet] - 10https://gerrit.wikimedia.org/r/477632 (https://phabricator.wikimedia.org/T210416) (owner: 10CDanis) [21:11:05] 10Operations, 10ops-codfw, 10netops: codfw row A recable and add QFX - https://phabricator.wikimedia.org/T210447 (10ayounsi) Parts keeps getting delayed, new shipping is expected for this Friday, rescheduling the work for next Wednesday. [21:11:30] 10Operations, 10ops-codfw, 10netops: codfw row A recable and add QFX - https://phabricator.wikimedia.org/T210447 (10ayounsi) [21:14:53] (03PS19) 10Paladox: phabricator: add data types to all parameters [puppet] - 10https://gerrit.wikimedia.org/r/471325 (owner: 10Dzahn) [21:14:59] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/471325 (owner: 10Dzahn) [21:15:02] 10Operations, 10Icinga, 10Scoring-platform-team, 10Patch-For-Review: Add ahalfaker to ORES-related icinga contacts - https://phabricator.wikimedia.org/T210742 (10jijiki) 05Open>03Resolved a:03jijiki @Halfak Let us know it everything works as it should :) [21:23:48] (03PS20) 10Paladox: phabricator: add data types to all parameters [puppet] - 10https://gerrit.wikimedia.org/r/471325 (owner: 10Dzahn) [21:23:48] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 292.08 seconds [21:23:53] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/471325 (owner: 10Dzahn) [21:30:35] (03PS21) 10Paladox: phabricator: add data types to all parameters [puppet] - 10https://gerrit.wikimedia.org/r/471325 (owner: 10Dzahn) [21:30:39] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/471325 (owner: 10Dzahn) [21:31:40] (03PS1) 10Papaul: DHCP: Add MAC address entries for elastic2045 - elastic2054 [puppet] - 10https://gerrit.wikimedia.org/r/477675 (https://phabricator.wikimedia.org/T210450) [21:33:22] (03CR) 10Dzahn: [V: 031] DHCP: Add MAC address entries for elastic2045 - elastic2054 [puppet] - 10https://gerrit.wikimedia.org/r/477675 (https://phabricator.wikimedia.org/T210450) (owner: 10Papaul) [21:33:26] (03CR) 10Dzahn: [V: 031 C: 032] DHCP: Add MAC address entries for elastic2045 - elastic2054 [puppet] - 10https://gerrit.wikimedia.org/r/477675 (https://phabricator.wikimedia.org/T210450) (owner: 10Papaul) [21:33:35] (03PS2) 10Dzahn: DHCP: Add MAC address entries for elastic2045 - elastic2054 [puppet] - 10https://gerrit.wikimedia.org/r/477675 (https://phabricator.wikimedia.org/T210450) (owner: 10Papaul) [21:34:12] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install elastic203[7-9], elastic204[0-9], elastic205[0-4] - https://phabricator.wikimedia.org/T210450 (10Papaul) [21:37:17] (03PS1) 10Dzahn: DHCP: fix one extra space in MAC of elastic2054 [puppet] - 10https://gerrit.wikimedia.org/r/477676 (https://phabricator.wikimedia.org/T210450) [21:37:36] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Grant fdans permissions to deploy AQS in prod, and accessing the aqs hosts - https://phabricator.wikimedia.org/T211095 (10jijiki) p:05Triage>03Normal [21:38:22] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Grant fdans permissions to deploy AQS in prod, and accessing the aqs hosts - https://phabricator.wikimedia.org/T211095 (10jijiki) Pending approval from SRE meeting [21:38:57] 10Operations, 10Citoid, 10Regression, 10VisualEditor (Current work): QIDs work locally but not in production with new translation-server - https://phabricator.wikimedia.org/T211148 (10Mvolz) p:05Triage>03Normal [21:39:23] 10Operations, 10SRE-Access-Requests, 10User-jijiki: Requesting access to `researchers` group for joewalsh - https://phabricator.wikimedia.org/T211115 (10jijiki) p:05Triage>03Normal [21:40:38] (03PS3) 10BPirkle: Create script to intentionally trigger fatal errors in MediaWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477450 (https://phabricator.wikimedia.org/T210567) [21:40:50] (03CR) 10Dzahn: [C: 032] DHCP: fix one extra space in MAC of elastic2054 [puppet] - 10https://gerrit.wikimedia.org/r/477676 (https://phabricator.wikimedia.org/T210450) (owner: 10Dzahn) [21:41:32] !log make cr3/4-ulsfo conform to jnt [21:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:43] (03CR) 10jerkins-bot: [V: 04-1] Create script to intentionally trigger fatal errors in MediaWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477450 (https://phabricator.wikimedia.org/T210567) (owner: 10BPirkle) [21:44:42] !log make cr2-eqord/eqdfw conform to jnt [21:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:38] PROBLEM - puppet last run on cp3038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:46:40] 10Operations: Usual git mechanism for aborting commit does not work on the private puppet repo - https://phabricator.wikimedia.org/T211121 (10jijiki) p:05Triage>03Normal [21:47:51] 10Operations, 10Gerrit: Convert Gerrit to use H2 as the database after 2.16 upgrade - https://phabricator.wikimedia.org/T211139 (10jijiki) [21:49:15] !log make cr1/2-codfw conform to jnt [21:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:24] (03PS4) 10BPirkle: Create script to intentionally trigger fatal errors in MediaWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477450 (https://phabricator.wikimedia.org/T210567) [21:54:42] 10Operations, 10ORES, 10Scoring-platform-team: Build helm charts for ORES - https://phabricator.wikimedia.org/T210269 (10jijiki) p:05Triage>03Normal [21:57:16] !log clear ethernet-swtiching table for labvirt1009:eth1's switch port [21:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:43] !log clear ethernet-swtiching table for labvirt1004:eth1's switch port [21:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:20] 10Operations, 10Analytics, 10Performance-Team, 10Traffic: Only serve debug HTTP headers when x-wikimedia-debug is present - https://phabricator.wikimedia.org/T210484 (10jijiki) p:05Triage>03Normal [22:05:37] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install elastic203[7-9], elastic204[0-9], elastic205[0-4] - https://phabricator.wikimedia.org/T210450 (10Papaul) [22:06:24] (03CR) 10BPirkle: "Refactored to use a class, thereby putting much less into the global namespace." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477450 (https://phabricator.wikimedia.org/T210567) (owner: 10BPirkle) [22:08:49] 10Operations, 10Deployments: Make failures on foreachwiki more obvious the deployer - https://phabricator.wikimedia.org/T210474 (10jijiki) p:05Triage>03Normal [22:09:12] (03CR) 10Dzahn: [C: 031] "https://puppet-compiler.wmflabs.org/compiler1002/13841/" [puppet] - 10https://gerrit.wikimedia.org/r/471325 (owner: 10Dzahn) [22:11:47] (03PS3) 10Gehel: Disable SPARQL logging [puppet] - 10https://gerrit.wikimedia.org/r/477622 (owner: 10Smalyshev) [22:12:52] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value at path = Missing keys: [uimage] [22:13:13] (03CR) 10Gehel: [C: 032] Disable SPARQL logging [puppet] - 10https://gerrit.wikimedia.org/r/477622 (owner: 10Smalyshev) [22:14:17] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install codfw logstash elasticsearch storage servers - https://phabricator.wikimedia.org/T211065 (10Papaul) [22:14:52] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value at path = Missing keys: [uimage] [22:15:18] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value at path = Missing keys: [uimage] [22:15:28] 10Operations, 10Mail, 10WMF-Legal: Tracking down gary@ and redirecting it to trustandsafety@ - https://phabricator.wikimedia.org/T210464 (10jijiki) p:05Triage>03Normal @bcampbell it would be great if @RStallman-legalteam or someone from legal could let us now how to handle this properly. [22:16:02] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy [22:16:46] RECOVERY - puppet last run on cp3038 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [22:17:38] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [22:18:09] 10Operations, 10Core Platform Team Backlog (Next), 10Patch-For-Review, 10Services (next): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10jijiki) p:05Triage>03Normal [22:22:27] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install elastic203[7-9], elastic204[0-9], elastic205[0-4] - https://phabricator.wikimedia.org/T210450 (10Papaul) [22:23:24] 10Operations, 10CirrusSearch, 10Discovery-Search: Find an alternative to curl connection pooling available in HHVM - https://phabricator.wikimedia.org/T210717 (10jijiki) p:05Triage>03Normal [22:24:39] 10Operations, 10monitoring, 10User-fgiunchedi: Address recurrent service check time out for "HP RAID" on swift backend hosts - https://phabricator.wikimedia.org/T210723 (10jijiki) p:05Triage>03Normal [22:27:08] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for Christoph Jauera (WMDE-Fisch) - https://phabricator.wikimedia.org/T211014 (10jijiki) p:05Triage>03Normal Pending SRE meeting approval. [22:28:40] (03PS22) 10Dzahn: phabricator: add data types to all parameters [puppet] - 10https://gerrit.wikimedia.org/r/471325 [22:30:14] (03CR) 10Dzahn: [C: 032] phabricator: add data types to all parameters [puppet] - 10https://gerrit.wikimedia.org/r/471325 (owner: 10Dzahn) [22:34:46] (03CR) 10Dzahn: [C: 032] "complete noop in production, phab1001, phab1002, phab2001, only change in compiler were strings to actual integers for cron times and port" [puppet] - 10https://gerrit.wikimedia.org/r/471325 (owner: 10Dzahn) [22:35:24] RECOVERY - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.201 second response time [22:46:57] !log remove neodymium/sarin from term labs-in4 on cr1/2-eqiad - T210612 [22:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:00] T210612: Remove neodymium/sarin from router ACLs - https://phabricator.wikimedia.org/T210612 [22:52:18] 10Operations, 10Discovery-Search (Current work): Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10jijiki) p:05Triage>03Normal [22:56:02] 10Operations, 10Toolforge, 10cloud-services-team (Kanban): tools-k8s-master-01 has two floating IPs - https://phabricator.wikimedia.org/T164123 (10GTirloni) When I was investigating what was using those floating IPs, I focused on network captures. Unfortunately, I missed the fact that `k8s-master.tools.wmfla... [23:08:41] (03PS1) 10Bstorm: sonofgridengine: explicit dependency for shadow_masters file [puppet] - 10https://gerrit.wikimedia.org/r/477700 (https://phabricator.wikimedia.org/T200557) [23:10:12] (03CR) 10Bstorm: [C: 032] sonofgridengine: explicit dependency for shadow_masters file [puppet] - 10https://gerrit.wikimedia.org/r/477700 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [23:19:24] RECOVERY - cassandra-a CQL 10.192.32.108:9042 on restbase2016 is OK: TCP OK - 0.036 second response time on 10.192.32.108 port 9042 [23:33:23] !log update prefix-list peering4 on cr1-eqsin to match jnt [23:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log