[00:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181129T0000). [00:00:04] dmaza: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:55] I'm here [00:01:00] ACKNOWLEDGEMENT - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T209395 [00:02:48] (03PS1) 10Catrope: Add throttle exception for Wikimedia event on December 6th [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476432 (https://phabricator.wikimedia.org/T210681) [00:03:16] I can do the SWAT [00:03:23] That also lets me add in my own patch :) [00:03:46] ;) [00:04:24] dmaza: Do you also have a config patch to enable $wgEnableBlockNoticeStats, or is it just a dark deploy for now? [00:04:39] (03CR) 10Catrope: [C: 032] Add throttle exception for Wikimedia event on December 6th [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476432 (https://phabricator.wikimedia.org/T210681) (owner: 10Catrope) [00:04:43] we are not enabling yet [00:05:52] 10Operations, 10Traffic: lvs1006 down - https://phabricator.wikimedia.org/T210683 (10Dzahn) [00:05:58] RoanKattouw: cheater :P [00:06:01] 10Operations, 10Traffic: lvs1006 down - https://phabricator.wikimedia.org/T210683 (10Dzahn) p:05Triage>03Normal [00:06:06] (03Merged) 10jenkins-bot: Add throttle exception for Wikimedia event on December 6th [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476432 (https://phabricator.wikimedia.org/T210681) (owner: 10Catrope) [00:06:15] ACKNOWLEDGEMENT - Host lvs1006 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T210683 [00:06:59] RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 36.19 ms [00:07:21] (03CR) 10Faidon Liambotis: [C: 032] Misc pylint fixes [software/keyholder] - 10https://gerrit.wikimedia.org/r/476429 (owner: 10Faidon Liambotis) [00:07:53] (03Merged) 10jenkins-bot: Misc pylint fixes [software/keyholder] - 10https://gerrit.wikimedia.org/r/476429 (owner: 10Faidon Liambotis) [00:08:51] !log catrope@deploy1001 Synchronized wmf-config/throttle.php: T210681 (duration: 01m 04s) [00:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:58] T210681: Throttle exemption for event at Wikimedia office on 2018-12-06 - https://phabricator.wikimedia.org/T210681 [00:11:53] PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100% [00:12:17] (03CR) 10jenkins-bot: Add throttle exception for Wikimedia event on December 6th [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476432 (https://phabricator.wikimedia.org/T210681) (owner: 10Catrope) [00:12:54] (03CR) 10Dzahn: [C: 04-1] "yea, simpler now but still conflicting with apache module for now https://puppet-compiler.wmflabs.org/compiler1002/13772/mwmaint1002.eqia" [puppet] - 10https://gerrit.wikimedia.org/r/416751 (owner: 10Dzahn) [00:14:25] (03PS11) 10Dzahn: analytics_cluster::webserver: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/416742 [00:17:27] RoanKattouw: let me know when it's done [00:17:41] RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms [00:18:12] dmaza: It failed Jenkins, trying again now [00:18:55] 👍 [00:23:47] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/13773/thorium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/416742 (owner: 10Dzahn) [00:47:45] RoanKattouw: are you still around? [00:47:52] Yes, it just merged [00:48:00] Pulling it onto mwdebug1002 now [00:48:49] dmaza: OK, ready for testing on mwdebug now [00:48:55] Insofar as there is anything to test [00:49:09] not really.. just wanna make sure nothing is on fire [00:49:17] one sec [00:52:42] RoanKattouw: everything looks good [00:53:40] OK, syncing [00:55:38] !log catrope@deploy1001 Synchronized php-1.33.0-wmf.6/includes/: Add block notice stats on EditPage (T201718) (duration: 01m 14s) [00:55:59] thank you [00:57:42] catrope@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [00:57:43] T201718: Tracking blocks: Log when the desktop VisualEditor + 2010 wikitext editor block notice is displayed - https://phabricator.wikimedia.org/T201718 [01:00:04] twentyafterfour: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181129T0100). [01:04:50] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, 10cloud-services-team (Kanban): Decommission labstore100[12] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10bd808) [01:12:50] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, 10cloud-services-team (Kanban): Decommission labstore100[12] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10bd808) [01:13:04] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, 10cloud-services-team (Kanban): Decommission labstore100[12] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10bd808) [01:18:44] 10Operations, 10Analytics, 10Security-Team, 10WMF-Legal, 10Software-Licensing: Can exfat be used in WMF production? - https://phabricator.wikimedia.org/T210667 (10chasemp) Small bit of background from my perspective, I had discussed this on hangout with a few folks who I will let acknowledge their own le... [01:24:08] (03CR) 10Dzahn: [C: 031] "based on old comments on PS7 "if pcc is ok feel free to merge :)" planning to go ahead with this and merge tomorrow" [puppet] - 10https://gerrit.wikimedia.org/r/416742 (owner: 10Dzahn) [02:14:37] (03CR) 10BBlack: [C: 031] cache/trafficserver: replace rutherfordium with people1001, backend and director [puppet] - 10https://gerrit.wikimedia.org/r/475236 (https://phabricator.wikimedia.org/T210036) (owner: 10Dzahn) [02:35:02] 10Operations, 10Traffic: lvs1006 down - https://phabricator.wikimedia.org/T210683 (10BBlack) Yeah I got busy and dropped this. Console was unresponsive initially. Reboot produced a responsive console, but wasn't able to initially ssh into the host (and no icinga recovery). With the fresh reboot, eth0 has no... [02:35:55] 10Operations, 10ops-eqiad, 10Traffic, 10netops: lvs1006 down - https://phabricator.wikimedia.org/T210683 (10BBlack) [02:54:25] 10Operations, 10ops-eqiad, 10Traffic, 10netops: lvs1006 down - https://phabricator.wikimedia.org/T210683 (10ayounsi) a:03Cmjohnson Port looks down (but not disabled) on the switch side, I'd say next step is for Chris to try re-seating then different cable/ports/etc. [03:33:53] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 884.97 seconds [04:00:24] (03CR) 10Andrew Bogott: [C: 031] openstack: Move Keystone DB credentials to my.cnf file [puppet] - 10https://gerrit.wikimedia.org/r/476109 (https://phabricator.wikimedia.org/T210404) (owner: 10GTirloni) [04:03:00] (03CR) 10Andrew Bogott: "I don't know what phragile is (I'm probably a member because I created the project) but overall this looks OK to me. I expect that it wil" [puppet] - 10https://gerrit.wikimedia.org/r/475032 (owner: 10Dzahn) [04:14:29] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 192.74 seconds [04:26:27] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 79, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:31:17] RECOVERY - Router interfaces on cr1-esams is OK: OK: host 91.198.174.245, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:59:09] RECOVERY - ensure kvm processes are running on labvirt1011 is OK: PROCS OK: 1 process with regex args /usr/bin/kvm [05:01:31] PROBLEM - ensure kvm processes are running on labvirt1011 is CRITICAL: PROCS CRITICAL: 0 processes with regex args /usr/bin/kvm [05:03:07] those labvirt1011 pages are weird but not important [05:52:25] 10Operations, 10Traffic: INMARSAT geolocates to the UK, leading to requests going to esams - https://phabricator.wikimedia.org/T209785 (10Reedy) Same IP going back, 161.30.203.16 ` Reedys-MacBook-Pro:~ reedy$ dig +short reflect.wikimedia.org 161.30.203.0 ` But it does seem to be going to ulsfo now, I guess a... [06:11:30] !log Deploy schema change on s6 primary master - T86338 [06:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:34] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [06:13:02] (03PS1) 10Marostegui: db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476451 (https://phabricator.wikimedia.org/T202167) [06:15:01] !log reedy@deploy1001 Synchronized php-1.33.0-wmf.6/extensions/OATHAuth: revert logging (loldeployingfromaplane) (duration: 00m 59s) [06:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:11] PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100% [06:20:23] RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms [06:27:00] (03PS1) 10Marostegui: Revert "dump_section.py: Increase retention from 18 days to 45" [puppet] - 10https://gerrit.wikimedia.org/r/476452 [06:27:15] (03PS2) 10Marostegui: Revert "dump_section.py: Increase retention from 18 days to 45" [puppet] - 10https://gerrit.wikimedia.org/r/476452 [06:28:07] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476451 (https://phabricator.wikimedia.org/T202167) (owner: 10Marostegui) [06:28:20] (03CR) 10Marostegui: [C: 032] Revert "dump_section.py: Increase retention from 18 days to 45" [puppet] - 10https://gerrit.wikimedia.org/r/476452 (owner: 10Marostegui) [06:29:10] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476451 (https://phabricator.wikimedia.org/T202167) (owner: 10Marostegui) [06:29:15] PROBLEM - puppet last run on ms-be1035 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [06:29:37] PROBLEM - puppet last run on mw1307 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/DigiCert_High_Assurance_CA-3.crt] [06:29:41] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476451 (https://phabricator.wikimedia.org/T202167) (owner: 10Marostegui) [06:31:07] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/R/update-library.R] [06:31:09] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1088 T202167 (duration: 00m 56s) [06:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:13] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [06:31:19] !log Deploy schema change on db1088 - T202167 [06:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:21] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476453 [06:33:19] PROBLEM - puppet last run on labmon1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modprobe.d/nf_conntrack.conf] [06:33:28] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476453 (owner: 10Marostegui) [06:34:30] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476453 (owner: 10Marostegui) [06:35:35] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1088 T202167 (duration: 00m 53s) [06:35:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:00] !log Deploy schema change on db1061 (s6 master) - T202167 [06:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:43] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476453 (owner: 10Marostegui) [06:46:57] PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100% [06:48:25] ^ that host rebooted itself (I am checking the idrac) [06:48:31] It is booting up now [06:49:25] RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 36.19 ms [06:53:05] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [06:55:10] (03PS1) 10Marostegui: db-eqiad.php: Depool pc1006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476454 (https://phabricator.wikimedia.org/T208383) [06:56:59] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:57:12] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool pc1006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476454 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [06:58:14] (03Merged) 10jenkins-bot: db-eqiad.php: Depool pc1006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476454 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [06:59:15] RECOVERY - puppet last run on labmon1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:59:29] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool pc1006 - T208383 (duration: 00m 53s) [06:59:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:33] T208383: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 [06:59:42] !log Stop MySQL on pc1006 to clone pc1009 - T208383 [06:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:20] RECOVERY - puppet last run on ms-be1035 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:00:41] RECOVERY - puppet last run on mw1307 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:05:51] 10Operations, 10Analytics, 10Security-Team, 10WMF-Legal, 10Software-Licensing: Can exfat be used in WMF production? - https://phabricator.wikimedia.org/T210667 (10Joe) >>! In T210667#4783582, @Legoktm wrote: >>>! In T210667#4783289, @MoritzMuehlenhoff wrote: >> exfat-fuse itself is free software (GPL) an... [07:08:07] PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100% [07:09:22] (03CR) 10jenkins-bot: db-eqiad.php: Depool pc1006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476454 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [07:10:31] RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms [07:11:45] 10Operations: ms-be2047 rebooting itself - https://phabricator.wikimedia.org/T210697 (10Marostegui) [07:12:03] 10Operations, 10ops-codfw: ms-be2047 rebooting itself - https://phabricator.wikimedia.org/T210697 (10Marostegui) p:05Triage>03Normal [07:22:55] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [07:26:34] (03PS3) 10Vgutierrez: gerrit: Use the certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/476301 (https://phabricator.wikimedia.org/T207050) [07:29:54] (03CR) 10Vgutierrez: "pcc is happy: https://puppet-compiler.wmflabs.org/compiler1002/13774/" [puppet] - 10https://gerrit.wikimedia.org/r/476301 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [07:52:36] (03CR) 10Giuseppe Lavagetto: [C: 031] "See my comment; this would be a -1 but Valentin guaranteed it's a temporary hack, so LGTM once you add a todo note there." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/476301 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [07:57:00] mhhh ores is in trouble [07:57:36] i.e. elevated 500s [07:57:52] (03PS4) 10Vgutierrez: gerrit: Use the certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/476301 (https://phabricator.wikimedia.org/T207050) [08:01:45] (03CR) 10Vgutierrez: [C: 032] gerrit: Use the certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/476301 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [08:02:00] (03PS5) 10Vgutierrez: gerrit: Use the certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/476301 (https://phabricator.wikimedia.org/T207050) [08:04:22] !log replacing TLS certificates in gerrit - T207050 [08:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:26] T207050: Migrate most standard public TLS certificates to CertCentral issuance - https://phabricator.wikimedia.org/T207050 [08:05:25] PROBLEM - PHP7 rendering on mw1261 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 724 bytes in 0.001 second response time [08:05:49] willikins:~ vgutierrez$ echo | openssl s_client -servername gerrit.wikimedia.org -connect gerrit.wikimedia.org:443 2>/dev/null | openssl x509 -noout -dates [08:05:49] notBefore=Nov 28 14:45:53 2018 GMT [08:08:35] I'm opening a task for ores 500s, there's exceptions in logs but I'm not sure where to start debugging [08:09:39] (03PS2) 10Muehlenhoff: Remove Diamond from redis::misc systems [puppet] - 10https://gerrit.wikimedia.org/r/476226 (https://phabricator.wikimedia.org/T183454) [08:13:54] 10Operations, 10ORES, 10Scoring-platform-team: ORES 500s since 2018-11-29 6:25 - https://phabricator.wikimedia.org/T210701 (10fgiunchedi) [08:14:14] (03CR) 10Muehlenhoff: [C: 032] Remove Diamond from redis::misc systems [puppet] - 10https://gerrit.wikimedia.org/r/476226 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [08:14:52] Amir1: around? T210701 [08:14:53] T210701: ORES 500s since 2018-11-29 6:25 - https://phabricator.wikimedia.org/T210701 [08:16:39] PROBLEM - Apache HTTP on mw1261 is CRITICAL: connect to address 10.64.0.56 and port 80: Connection refused [08:17:05] PROBLEM - Nginx local proxy to apache on mw1261 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.008 second response time [08:17:07] PROBLEM - HHVM rendering on mw1261 is CRITICAL: connect to address 10.64.0.56 and port 80: Connection refused [08:17:07] PROBLEM - Check systemd state on mw1261 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:17:11] <_joe_> mw1261 is me [08:17:15] <_joe_> sorry for the noise [08:17:22] <_joe_> it's depooled, it will take some time to fix [08:37:11] RECOVERY - Nginx local proxy to apache on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.058 second response time [08:37:13] RECOVERY - HHVM rendering on mw1261 is OK: HTTP OK: HTTP/1.1 200 OK - 75945 bytes in 0.154 second response time [08:37:17] RECOVERY - Check systemd state on mw1261 is OK: OK - running: The system is fully operational [08:37:57] RECOVERY - Apache HTTP on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.070 second response time [08:39:28] godog: I just woke up [08:39:42] Amir1: good morning! [08:39:47] RECOVERY - PHP7 rendering on mw1261 is OK: HTTP OK: HTTP/1.1 200 OK - 75985 bytes in 0.204 second response time [08:39:52] Someone deployed a change. Where are getting it [08:39:59] Is it prod? [08:40:20] it is yeah [08:40:39] Amir1: https://phabricator.wikimedia.org/T210701 [08:40:51] Shoot [08:41:00] it's not just itemequality btw, nothing specific to it [08:41:07] I also see goodfaith and so on [08:41:14] This is on master but we didn't deploy it [08:41:17] <_joe_> sorry, I have a question [08:41:28] <_joe_> why don't we get any alert for this? [08:41:45] <_joe_> well this can be answerd later [08:42:12] _joe_: it seems this happens for some cases and not all [08:42:35] <_joe_> Amir1: looking at grafana, it seems nothing works, but well [08:42:36] _joe_: there is one (albeit it's just one and not really helping). [08:42:49] https://grafana.wikimedia.org/dashboard/db/ores grafana alertCRITICAL2018-11-29 08:41:520d 2h 6m 35s3/3CRITICAL: https://grafana.wikimedia.org/dashboard/db/ores is alerting: 5xx rate (Change prop) alert. [08:42:56] I missed it too btw [08:43:00] never really saw it [08:43:04] <_joe_> so did anyone try to restart ores workers? [08:43:04] there's also the availability alert, which is how I noticed [08:43:13] <_joe_> it's clear something went very wrong at logrotate time [08:43:36] 500 is not in this https://grafana.wikimedia.org/dashboard/db/ores?refresh=1m&panelId=47&fullscreen&orgId=1 [08:43:45] I should add it, and then make a graph [08:43:55] <_joe_> akosiaris: that alert should page imho [08:43:56] anyway, let's get back to the main issue in hand [08:44:21] yeah let's write that we need better alerting in the incident report [08:44:29] but let's actually figure out the problem now [08:44:35] Active: active (running) since Wed 2018-11-28 10:13:46 UTC; 22h ago [08:44:42] that's the celery worker on scb1001 [08:44:44] em [08:44:45] <_joe_> ok, can I try to restart one of the workers? [08:44:46] ores1001 [08:44:51] <_joe_> akosiaris: same on ores1003 [08:44:54] <_joe_> 22 h ago [08:45:21] lemme gather some stats whether all workers are in the same state [08:45:59] yup, all saying 22 hours ago [08:46:06] <_joe_> akosiaris: so probably some release [08:46:13] I was expecting some issue with logrotate tbh [08:46:21] the 6:25 time is awfully weird [08:46:28] 10Operations, 10Services: Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10MoritzMuehlenhoff) [08:46:28] <_joe_> I'll restart uwsgi on ores1003 [08:46:41] <_joe_> akosiaris: yeah let's see what logrotate does [08:46:53] uwsgi-ores is also since 22h ago [08:46:55] <_joe_> -rw-r--r-- 1 www-data www-data 6004 Nov 29 06:25 app.log.1 [08:46:56] across the fleet [08:47:04] <_joe_> so we rotate at 6:25 [08:47:26] <_joe_> postrotate [08:47:28] <_joe_> service uwsgi-ores reload [08:47:30] <_joe_> endscript [08:47:37] <_joe_> so yeah we did a reload at 6:25 [08:47:41] it was already complaining about something else though [08:47:43] requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='www.wikidata.org', port=443): Read timed out. (read timeout=5.0) [08:47:44] <_joe_> that's what caused the issue [08:47:51] that's before the rotate [08:48:10] <_joe_> but the rotate is when all went down [08:48:18] akosiaris: that happens from time to time [08:48:32] https://logstash.wikimedia.org/goto/543257bf91de9e695e5344a7dc382850 [08:48:42] <_joe_> akosiaris: let's restart one worker; if it recovers, we can depool a few for debugging and restore the service [08:48:45] well, not all, there are still some scores returned [08:48:58] (03PS12) 10Elukey: analytics_cluster::webserver: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/416742 (owner: 10Dzahn) [08:48:59] <_joe_> !log restarting uwsgi-ores on ores1003 [08:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:03] but those are probably from the cache [08:49:19] <_joe_> akosiaris: I was about to say [08:49:38] I see it did not fix anything [08:49:41] I can make the fix right now [08:49:41] <_joe_> ok interstingly [08:49:44] <_joe_> this solved nothing [08:50:04] Amir1: what fix ? have you already identified the issue ? [08:50:06] <_joe_> did we change something in celery? [08:50:13] <_joe_> Amir1: please do [08:50:15] akosiaris: revert my puppet patch from yesterday [08:50:20] ok doing so [08:50:36] <_joe_> also, why do we reload ores if we do copytruncate anyways [08:51:00] akosiaris: https://gerrit.wikimedia.org/r/c/operations/puppet/+/476250 [08:51:24] (03PS1) 10Alexandros Kosiaris: Revert "ores: Remove added celery configs" [puppet] - 10https://gerrit.wikimedia.org/r/476458 [08:51:28] <_joe_> ohhh I see [08:51:29] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Revert "ores: Remove added celery configs" [puppet] - 10https://gerrit.wikimedia.org/r/476458 (owner: 10Alexandros Kosiaris) [08:51:38] (03PS2) 10Alexandros Kosiaris: Revert "ores: Remove added celery configs" [puppet] - 10https://gerrit.wikimedia.org/r/476458 [08:51:40] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Revert "ores: Remove added celery configs" [puppet] - 10https://gerrit.wikimedia.org/r/476458 (owner: 10Alexandros Kosiaris) [08:51:58] <_joe_> akosiaris: need me to run puppet across all ores servers? [08:52:04] niah I got it [08:52:05] The reason being is that this puppet change didn't trigger the celery to restart so its issue went unnoticed until 6 in the morning [08:52:32] well celery wasn't restarted either in 6:25 in the morning [08:52:44] the more underlying problem is that the config exists in the code and it got deployed but it's under some other name "local_celery" [08:52:48] it was uwsgi that received the reload [08:52:56] <_joe_> Amir1: I don't think the issue is in celery tbh [08:53:05] yeah, both of them would be affected [08:53:18] * akosiaris running puppet [08:53:21] because this is about both sending and receiving it [08:53:28] <_joe_> yes [08:53:48] ah yes celery tightly couples the consumer and the producer [08:53:52] <_joe_> that's what I was about to say, it's how uwsgi sends data if you pickle [08:54:09] https://github.com/wikimedia/ores/blob/master/config/00-main.yaml [08:54:22] restarting uwsgi and celery in eqiad [08:54:24] task_serializer: 'pickle' [08:54:27] 10Operations, 10Core Platform Team Backlog (Next), 10Services (next): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10mobrovac) [08:54:37] this hasn't been applied because it's under "local_celery" [08:54:49] (03CR) 10Gehel: "> Patch Set 4: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/475093 (https://phabricator.wikimedia.org/T206639) (owner: 10Mathew.onipe) [08:54:55] <_joe_> Amir1: I see a ton of [08:55:05] <_joe_> WARNING revscoring.scoring.environment: Differences between the current environment and the environment in which the model was constructed environment were detected [08:55:10] (03PS1) 10Vgutierrez: gerrit: Switch between old LE puppetization and certcentral using hiera [puppet] - 10https://gerrit.wikimedia.org/r/476459 (https://phabricator.wikimedia.org/T207050) [08:55:27] <_joe_> is that really an info we should keep at WARNING for every score? [08:55:35] 2018-11-29 08:55:25,073 WARNING ores.scoring_systems.celery_queue: Queue size is too full 425 [08:55:38] ok that's good [08:55:49] it means the 2 components are communicating again [08:55:50] <_joe_> akosiaris: yeah things are back now [08:56:00] (03CR) 10jerkins-bot: [V: 04-1] gerrit: Switch between old LE puppetization and certcentral using hiera [puppet] - 10https://gerrit.wikimedia.org/r/476459 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [08:56:01] _joe_: it's not on every score, it's on every restart [08:56:08] I want to fix it too though [08:56:44] to be pedantic, a start of a celery worker [08:56:47] <_joe_> akosiaris: next time maybe restart them in a rolling fashion, I saw pybal cry :P [08:57:14] so ores was down for three hours /o\ [08:57:18] <_joe_> it's ok in this case, though, as the service was effectively down [08:57:22] _joe_: yeah I usually do that, but now we were in a state of emergency [08:57:29] <_joe_> Amir1: 2 hours 30, but yes [08:57:42] let me get out of bed, get to the office and will write a detailed incident report [08:57:54] <_joe_> Amir1: take your time :) [08:58:02] best way to start your day, I don't need coffee anymore [08:58:07] lol [08:58:16] <_joe_> Amir1: inorite [08:58:28] <_joe_> Amir1: it's even better when you get paged at 3 am though [08:58:37] <_joe_> you should try it sometimes! [08:58:39] <_joe_> :P [08:58:50] :(((( [08:58:59] (03PS2) 10Vgutierrez: gerrit: Switch between old LE puppetization and certcentral using hiera [puppet] - 10https://gerrit.wikimedia.org/r/476459 (https://phabricator.wikimedia.org/T207050) [08:59:54] 10Operations, 10ops-codfw: rack/setup/install elastic203[7-9], elastic204[0-9], elastic205[0-4] - https://phabricator.wikimedia.org/T210450 (10Gehel) >>! In T210450#4781852, @Papaul wrote: > @Gehel In this case the racking proposal will not work since those racks are 1G rack. I will update the task descriptio... [09:01:10] 10Operations, 10ops-codfw: rack/setup/install elastic203[7-9], elastic204[0-9], elastic205[0-4] - https://phabricator.wikimedia.org/T210450 (10Gehel) The new racking proposal looks good to me (new servers are still in the same row as the previous proposal, which is all I care about). [09:01:43] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [09:01:51] 10Operations, 10ops-codfw: rack/setup/install elastic203[7-9], elastic204[0-9], elastic205[0-4] - https://phabricator.wikimedia.org/T210450 (10Gehel) [09:02:48] (03CR) 10Vgutierrez: "pcc shows the expected changes in the production environment: https://puppet-compiler.wmflabs.org/compiler1002/13776/" [puppet] - 10https://gerrit.wikimedia.org/r/476459 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [09:04:18] sigh, now we 've reached redis.exceptions.ConnectionError: max number of clients reached [09:04:29] with 2,5 hours of scores backlogged ... [09:09:47] 10Operations, 10Traffic, 10Patch-For-Review: Migrate most standard public TLS certificates to CertCentral issuance - https://phabricator.wikimedia.org/T207050 (10Vgutierrez) [09:09:55] (03CR) 10Giuseppe Lavagetto: [C: 031] "Well done!" [puppet] - 10https://gerrit.wikimedia.org/r/476459 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [09:11:30] ok fixed. increase nofile and maxclients for redis [09:11:55] !log increase nofile of process to 20k and maxclients to 15k to account for the backlog of ores scorings [09:11:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:00] 10Operations, 10Core Platform Team Backlog (Next), 10Services (next): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10MoritzMuehlenhoff) [09:13:55] (03PS1) 10Ema: cache_canary: stop using exp admission policy [puppet] - 10https://gerrit.wikimedia.org/r/476460 [09:14:11] (03PS2) 10Ema: cache_canary: stop using exp admission policy [puppet] - 10https://gerrit.wikimedia.org/r/476460 [09:15:11] (03CR) 10Ema: [C: 032] cache_canary: stop using exp admission policy [puppet] - 10https://gerrit.wikimedia.org/r/476460 (owner: 10Ema) [09:31:09] (03PS2) 10Ema: cache: stop using nhw admission policy [puppet] - 10https://gerrit.wikimedia.org/r/476311 (https://phabricator.wikimedia.org/T144187) [09:32:15] (03CR) 10Ema: [C: 032] cache: stop using nhw admission policy [puppet] - 10https://gerrit.wikimedia.org/r/476311 (https://phabricator.wikimedia.org/T144187) (owner: 10Ema) [09:32:32] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/476393 (https://phabricator.wikimedia.org/T210486) (owner: 10Cwhite) [09:33:23] (03CR) 10Filippo Giunchedi: [C: 031] swift: Fix checks on drive/filesystem titles to allow for labs ones [puppet] - 10https://gerrit.wikimedia.org/r/402758 (https://phabricator.wikimedia.org/T184236) (owner: 10Alex Monk) [09:34:03] (03PS6) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) [09:35:37] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [09:36:19] ACKNOWLEDGEMENT - Backup of s2 in eqiad on db1115 is CRITICAL: Backup for s2 at eqiad taken more than 8 days ago: Most recent backup 2018-11-20 23:04:07 Banyek backup was failed because of recloning of the backup source host. Its fixed, not the backup is ongoing [09:39:01] 10Operations, 10ORES, 10Scoring-platform-team, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Backlog): Blubber should be able to make multi docker files per repo - https://phabricator.wikimedia.org/T210267 (10zeljkofilipin) [09:41:26] (03CR) 10Elukey: [C: 032] analytics_cluster::webserver: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/416742 (owner: 10Dzahn) [09:41:34] (03PS13) 10Elukey: analytics_cluster::webserver: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/416742 (owner: 10Dzahn) [09:41:44] 10Operations, 10ops-codfw: ms-be2047 rebooting itself - https://phabricator.wikimedia.org/T210697 (10fgiunchedi) [09:41:47] 10Operations, 10ops-codfw: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10fgiunchedi) [09:41:55] 10Operations, 10ops-codfw: ms-be2047 rebooting itself - https://phabricator.wikimedia.org/T210697 (10fgiunchedi) Indeed, faulty hardware :( [09:50:19] labsdb1010 maintenance in 10 minutes [09:53:19] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:53:41] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:53:47] (03CR) 10Alex Monk: [C: 031] "Preferable to realm branching." [puppet] - 10https://gerrit.wikimedia.org/r/476459 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [09:54:06] (03PS2) 10Jcrespo: admin: Add Ryan Steinberg and Joe Wass access to production cluster [puppet] - 10https://gerrit.wikimedia.org/r/476039 (https://phabricator.wikimedia.org/T209298) [09:54:29] (03PS7) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) [09:54:39] (03CR) 10jerkins-bot: [V: 04-1] admin: Add Ryan Steinberg and Joe Wass access to production cluster [puppet] - 10https://gerrit.wikimedia.org/r/476039 (https://phabricator.wikimedia.org/T209298) (owner: 10Jcrespo) [09:55:24] !log restarting prometheus-elasticsearch-exporter-9200 on all elastic cirrus nodes [09:55:26] (03PS3) 10Jcrespo: admin: Add Ryan Steinberg and Joe Wass access to production cluster [puppet] - 10https://gerrit.wikimedia.org/r/476039 (https://phabricator.wikimedia.org/T209298) [09:55:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:56] (03CR) 10Elukey: [C: 032] "No op on thorium as far as I can see, thanks Daniel!" [puppet] - 10https://gerrit.wikimedia.org/r/416742 (owner: 10Dzahn) [10:01:26] !log depooling labsdb1010 due of maintenance - T209517 [10:01:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:32] T209517: Upgrade/reboot labsdb* servers - https://phabricator.wikimedia.org/T209517 [10:01:55] 10Operations, 10Core Platform Team Backlog (Next), 10Services (next): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10akosiaris) Does this mean he have a hard deadline of 2019-04-01 for completing the migrations? Or per the "I can backport security fixes for... [10:01:58] (03CR) 10Banyek: [C: 032] wiki replicas: depool labsdb1010 for upgrades [puppet] - 10https://gerrit.wikimedia.org/r/476412 (https://phabricator.wikimedia.org/T209517) (owner: 10Bstorm) [10:02:07] (03PS2) 10Banyek: wiki replicas: depool labsdb1010 for upgrades [puppet] - 10https://gerrit.wikimedia.org/r/476412 (https://phabricator.wikimedia.org/T209517) (owner: 10Bstorm) [10:02:11] (03CR) 10Banyek: [V: 032 C: 032] wiki replicas: depool labsdb1010 for upgrades [puppet] - 10https://gerrit.wikimedia.org/r/476412 (https://phabricator.wikimedia.org/T209517) (owner: 10Bstorm) [10:07:04] (03PS6) 10GTirloni: openstack: Move Keystone DB credentials to my.cnf file [puppet] - 10https://gerrit.wikimedia.org/r/476109 (https://phabricator.wikimedia.org/T210404) [10:10:26] (03CR) 10GTirloni: [C: 032] openstack: Move Keystone DB credentials to my.cnf file [puppet] - 10https://gerrit.wikimedia.org/r/476109 (https://phabricator.wikimedia.org/T210404) (owner: 10GTirloni) [10:16:28] !log T209626 icinga downtime labvirt1011 for 1 month to avoid bogus pages [10:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:34] T209626: Empty labvirt1010 and 1011 before their leases expire - https://phabricator.wikimedia.org/T209626 [10:17:17] !log remove zookeeper's crontabs from conf100[1-3] to fix cronspam [10:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:06] just got to the office, will start writing the incident report quickly [10:18:21] (03CR) 10Muehlenhoff: [C: 031] "Looks good, all the NDAs/sign-offs are in place and the user data is fine as well." [puppet] - 10https://gerrit.wikimedia.org/r/476039 (https://phabricator.wikimedia.org/T209298) (owner: 10Jcrespo) [10:19:20] ACKNOWLEDGEMENT - haproxy failover on dbproxy1011 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Banyek T209517 [10:20:55] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:21:13] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:31:40] (03PS26) 10Mathew.onipe: elasticsearch: cookbook for multi-cluster services rolling restart [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) [10:32:01] (03CR) 10Gehel: [C: 04-1] "This is superseeded by I0f1578aacc181ede284cef045e66258264b143ad and can probably be abandoned." [puppet] - 10https://gerrit.wikimedia.org/r/475944 (https://phabricator.wikimedia.org/T210265) (owner: 10Mathew.onipe) [10:32:59] (03CR) 10Gehel: [C: 031] "LGTM, let's wait until the servers are racked to merge." [puppet] - 10https://gerrit.wikimedia.org/r/475942 (https://phabricator.wikimedia.org/T210265) (owner: 10Mathew.onipe) [10:34:55] RECOVERY - Backup of s2 in eqiad on db1115 is OK: Backup for s2 at eqiad taken less than 8 days ago and larger than 10 GB: Last one 2018-11-29 09:32:11 from db1095.eqiad.wmnet:3312 (107 GB) [10:36:43] (03PS1) 10Marostegui: pc1009: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/476469 (https://phabricator.wikimedia.org/T208383) [10:40:12] (03PS1) 10DCausse: elasticsearch: add psi & omega in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/476471 (https://phabricator.wikimedia.org/T207918) [10:42:20] (03PS2) 10Filippo Giunchedi: hieradata: add kafka_shipper::kafka_brokers variable to deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/476029 [10:42:22] (03PS6) 10Filippo Giunchedi: rsyslog: add UDP localhost compatibility endpoint [puppet] - 10https://gerrit.wikimedia.org/r/475352 (https://phabricator.wikimedia.org/T205851) [10:42:24] (03PS1) 10Filippo Giunchedi: logstash: add new logging kafka consumer [puppet] - 10https://gerrit.wikimedia.org/r/476472 (https://phabricator.wikimedia.org/T205851) [10:42:26] (03PS1) 10Filippo Giunchedi: logstash: copy 'severity' into 'level' where needed [puppet] - 10https://gerrit.wikimedia.org/r/476473 (https://phabricator.wikimedia.org/T205851) [10:42:28] (03CR) 10Gehel: [C: 04-1] "Very minor comments inline, otherwise LGTM" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) (owner: 10Mathew.onipe) [10:45:22] 10Operations, 10DBA, 10Patch-For-Review, 10User-Banyek: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10Marostegui) [10:46:08] (03CR) 10Marostegui: [C: 032] pc1009: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/476469 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [10:51:39] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:51:41] 10Operations, 10Core Platform Team Backlog (Next), 10Services (next): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10MoritzMuehlenhoff) >>! In T210704#4784515, @akosiaris wrote: > Does this mean he have a hard deadline of 2019-04-01 for completing the migra... [10:51:55] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:56:09] (03PS7) 10Filippo Giunchedi: rsyslog: add UDP localhost compatibility endpoint [puppet] - 10https://gerrit.wikimedia.org/r/475352 (https://phabricator.wikimedia.org/T205851) [10:56:11] (03PS2) 10Filippo Giunchedi: logstash: add new logging kafka consumer [puppet] - 10https://gerrit.wikimedia.org/r/476472 (https://phabricator.wikimedia.org/T205851) [10:56:13] (03PS2) 10Filippo Giunchedi: logstash: copy 'severity' into 'level' where needed [puppet] - 10https://gerrit.wikimedia.org/r/476473 (https://phabricator.wikimedia.org/T205851) [10:56:14] 10Operations, 10Electron-PDFs, 10Proton, 10Epic, and 4 others: [EPIC] New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10phuedx) [11:08:18] 10Puppet, 10Phabricator: Local config file contains escape characters - https://phabricator.wikimedia.org/T103924 (10Aklapper) 05Open>03declined No reply. :( Please reopen this task when clarifying what is the actual problem here and where. (I also assume this is an upstream issue?) Thanks a lot! [11:24:16] 10Operations, 10CirrusSearch, 10Discovery-Search: Find an alternative to curl connection pooling available in HHVM - https://phabricator.wikimedia.org/T210717 (10dcausse) [11:24:44] 10Operations, 10CirrusSearch, 10Discovery-Search: Find an alternative to curl connection pooling available in HHVM - https://phabricator.wikimedia.org/T210717 (10dcausse) [11:24:54] 10Operations, 10Core Platform Team Backlog (Watching / External), 10HHVM, 10Patch-For-Review, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10dcausse) [11:27:29] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:27:53] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:36:04] (03Abandoned) 10Mathew.onipe: cirrus.yaml: add new elastic2037-elastic2054 to existing clusters [puppet] - 10https://gerrit.wikimedia.org/r/475944 (https://phabricator.wikimedia.org/T210265) (owner: 10Mathew.onipe) [11:39:53] 10Operations, 10Traffic: Varnish won't purge thumbnails of specific file - https://phabricator.wikimedia.org/T207615 (10Aklapper) Let's close as declined as noone can reproduce anymore? [12:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181129T1200). [12:00:04] CFisch_WMDE and dcausse: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:22] o/ [12:00:25] \o/ [12:00:32] \o [12:00:45] (03PS1) 10Banyek: Revert "wiki replicas: depool labsdb1010 for upgrades" [puppet] - 10https://gerrit.wikimedia.org/r/476482 [12:01:00] !log repooling labsdb1010 after upgrades - T209517 [12:01:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:05] T209517: Upgrade/reboot labsdb* servers - https://phabricator.wikimedia.org/T209517 [12:01:06] dcausse: go ahead with your patches, do you want to deploy CFisch_WMDE's patch, or should I do it? [12:01:08] (03PS1) 10Elukey: profile::hive::client: add support for kerberos to Beeline [puppet] - 10https://gerrit.wikimedia.org/r/476483 [12:01:10] (03PS1) 10Elukey: profile::hive::client: move beeline's erb from role to profile ns [puppet] - 10https://gerrit.wikimedia.org/r/476484 [12:01:28] zeljkof: ok deploying [12:01:51] (03CR) 10Banyek: [C: 032] Revert "wiki replicas: depool labsdb1010 for upgrades" [puppet] - 10https://gerrit.wikimedia.org/r/476482 (owner: 10Banyek) [12:02:03] (03PS2) 10Banyek: Revert "wiki replicas: depool labsdb1010 for upgrades" [puppet] - 10https://gerrit.wikimedia.org/r/476482 [12:02:06] (03CR) 10Banyek: [V: 032 C: 032] Revert "wiki replicas: depool labsdb1010 for upgrades" [puppet] - 10https://gerrit.wikimedia.org/r/476482 (owner: 10Banyek) [12:02:08] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475745 (https://phabricator.wikimedia.org/T198352) (owner: 10DCausse) [12:02:28] (03PS4) 10Jcrespo: admin: Add Ryan Steinberg and Joe Wass access to production cluster [puppet] - 10https://gerrit.wikimedia.org/r/476039 (https://phabricator.wikimedia.org/T209298) [12:03:25] (03Merged) 10jenkins-bot: [cirrus] Use normal config for labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475745 (https://phabricator.wikimedia.org/T198352) (owner: 10DCausse) [12:04:01] (03PS2) 10Elukey: profile::hive::client: add support for kerberos to Beeline [puppet] - 10https://gerrit.wikimedia.org/r/476483 [12:04:22] (03CR) 10Jcrespo: [C: 032] admin: Add Ryan Steinberg and Joe Wass access to production cluster [puppet] - 10https://gerrit.wikimedia.org/r/476039 (https://phabricator.wikimedia.org/T209298) (owner: 10Jcrespo) [12:05:29] (03CR) 10Elukey: [C: 032] profile::hive::client: add support for kerberos to Beeline [puppet] - 10https://gerrit.wikimedia.org/r/476483 (owner: 10Elukey) [12:05:49] (03PS3) 10Elukey: profile::hive::client: add support for kerberos to Beeline [puppet] - 10https://gerrit.wikimedia.org/r/476483 [12:05:52] (03CR) 10Elukey: [V: 032 C: 032] profile::hive::client: add support for kerberos to Beeline [puppet] - 10https://gerrit.wikimedia.org/r/476483 (owner: 10Elukey) [12:06:08] (03PS2) 10Elukey: profile::hive::client: move beeline's erb from role to profile ns [puppet] - 10https://gerrit.wikimedia.org/r/476484 [12:07:08] 10Operations, 10Core Platform Team Backlog (Watching / External), 10HHVM, 10Patch-For-Review, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Joe) [12:07:57] (03CR) 10Elukey: [C: 032] profile::hive::client: move beeline's erb from role to profile ns [puppet] - 10https://gerrit.wikimedia.org/r/476484 (owner: 10Elukey) [12:09:22] (03PS1) 10Jcrespo: Revert "admin: Add Ryan Steinberg and Joe Wass access to production cluster" [puppet] - 10https://gerrit.wikimedia.org/r/476485 [12:09:31] (03PS2) 10Jcrespo: Revert "admin: Add Ryan Steinberg and Joe Wass access to production cluster" [puppet] - 10https://gerrit.wikimedia.org/r/476485 [12:09:54] !log dcausse@deploy1001 Synchronized wmf-config/CirrusSearch-production.php: T210381: [cirrus] Use normal config for labswiki (duration: 00m 55s) [12:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:00] T210381: Update mw-config to use the psi&omega elastic clusters in codfw - https://phabricator.wikimedia.org/T210381 [12:10:15] (03CR) 10Jcrespo: [V: 032 C: 032] Revert "admin: Add Ryan Steinberg and Joe Wass access to production cluster" [puppet] - 10https://gerrit.wikimedia.org/r/476485 (owner: 10Jcrespo) [12:10:21] (03CR) 10Jcrespo: [V: 032 C: 032] "Failed" [puppet] - 10https://gerrit.wikimedia.org/r/476485 (owner: 10Jcrespo) [12:10:23] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475746 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [12:10:26] (03PS1) 10Ladsgroup: Revert "Revert "labs: Add mediainfo to federation config"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476486 [12:10:31] (03CR) 10Ladsgroup: [C: 032] Revert "Revert "labs: Add mediainfo to federation config"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476486 (owner: 10Ladsgroup) [12:10:39] (03CR) 10jenkins-bot: [cirrus] Use normal config for labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475745 (https://phabricator.wikimedia.org/T198352) (owner: 10DCausse) [12:11:28] (03Merged) 10jenkins-bot: [cirrus] multi-instance: add cirrussearch-big-indices.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475746 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [12:11:42] (03CR) 10jenkins-bot: [cirrus] multi-instance: add cirrussearch-big-indices.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475746 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [12:12:09] (03Merged) 10jenkins-bot: Revert "Revert "labs: Add mediainfo to federation config"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476486 (owner: 10Ladsgroup) [12:12:36] (03PS1) 10Jcrespo: Revert "Revert "admin: Add Ryan Steinberg and Joe Wass access to production cluster"" [puppet] - 10https://gerrit.wikimedia.org/r/476487 [12:13:45] !log dcausse@deploy1001 Synchronized dblists/cirrussearch-big-indices.dblist: T210381: [cirrus] multi-instance: add cirrussearch-big-indices.dblist (duration: 00m 53s) [12:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:49] PROBLEM - puppet last run on notebook1004 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): User[ryanmax],User[afandian2] [12:14:03] that is me, fixing [12:14:05] zeljkof: I'm done [12:14:13] (a puppet run would also work) [12:14:30] 10Operations, 10Traffic: Varnish won't purge thumbnails of specific file - https://phabricator.wikimedia.org/T207615 (10Gilles) 05Open>03declined Sure, 'till next time ;) [12:15:04] dcausse: great! want to deploy CFisch_WMDE's patch, or should I do it? :) [12:15:14] zeljkof: I can [12:15:27] dcausse: great! please do then :) [12:15:29] PROBLEM - puppet last run on people1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): User[ryanmax],User[afandian2] [12:15:38] (03PS2) 10Jcrespo: Revert "Revert "admin: Add Ryan Steinberg and Joe Wass access to production cluster"" [puppet] - 10https://gerrit.wikimedia.org/r/476487 (https://phabricator.wikimedia.org/T209298) [12:15:38] :-) [12:17:29] (03PS3) 10Jcrespo: Revert "Revert "admin: Add Ryan Steinberg and Joe Wass access to production cluster"" [puppet] - 10https://gerrit.wikimedia.org/r/476487 (https://phabricator.wikimedia.org/T209298) [12:17:37] PROBLEM - puppet last run on rutherfordium is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): User[ryanmax],User[afandian2] [12:19:13] (03CR) 10Jcrespo: [C: 032] Revert "Revert "admin: Add Ryan Steinberg and Joe Wass access to production cluster"" [puppet] - 10https://gerrit.wikimedia.org/r/476487 (https://phabricator.wikimedia.org/T209298) (owner: 10Jcrespo) [12:21:53] 10Operations, 10Research-Programs, 10SRE-Access-Requests, 10Epic, 10Patch-For-Review: access to analytics-privatedata-users for @toddleroux, @Afandian, & @RyanSteinberg - https://phabricator.wikimedia.org/T209298 (10jcrespo) [12:23:41] (03CR) 10jenkins-bot: Revert "Revert "labs: Add mediainfo to federation config"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476486 (owner: 10Ladsgroup) [12:24:50] 10Puppet, 10Beta-Cluster-Infrastructure: [Cloud VPS alert] Puppet failure on deployment-logstash2.deployment-prep.eqiad.wmflabs - https://phabricator.wikimedia.org/T210718 (10MarcoAurelio) [12:26:17] CFisch_WMDE: jenkins is not happy :/ [12:26:27] dcausse: yeah ... these browser test [12:26:30] should we try again? [12:26:33] try to trigger it again plz [12:26:42] sure [12:26:45] thanks [12:27:38] 10Puppet, 10ORES, 10Scoring-platform-team (Current), 10User-Ladsgroup, 10Wikimedia-Incident: ORES services should bind to ores config files - https://phabricator.wikimedia.org/T210719 (10Ladsgroup) [12:29:48] (03PS1) 10Elukey: hive-site.xml: render hive.metastore.sasl.enabled only on metastore [puppet/cdh] - 10https://gerrit.wikimedia.org/r/476490 [12:29:50] 10Operations, 10Research-Programs, 10SRE-Access-Requests, 10Epic, 10Patch-For-Review: access to analytics-privatedata-users for @toddleroux, @Afandian, & @RyanSteinberg - https://phabricator.wikimedia.org/T209298 (10jcrespo) a:03toddleroux ` Notice: /Stage[main]/Admin/Admin::Hashuser[ryanmax]/Admin::Us... [12:30:02] 10Operations, 10Puppet, 10ORES, 10Scoring-platform-team, 10Wikimedia-Incident: Logrotate should restart services when more people are around - https://phabricator.wikimedia.org/T210720 (10Ladsgroup) [12:30:09] (03CR) 10Elukey: [V: 032 C: 032] hive-site.xml: render hive.metastore.sasl.enabled only on metastore [puppet/cdh] - 10https://gerrit.wikimedia.org/r/476490 (owner: 10Elukey) [12:30:47] !log run puppet on notebook1004, people1001, rutherfordium to fix failures [12:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:04] (03PS1) 10Elukey: Update cdh module to its latest sha [puppet] - 10https://gerrit.wikimedia.org/r/476491 [12:33:53] (03CR) 10Elukey: [C: 032] Update cdh module to its latest sha [puppet] - 10https://gerrit.wikimedia.org/r/476491 (owner: 10Elukey) [12:34:29] RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:35:55] 10Operations, 10cloud-services-team (Kanban): Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377 (10aborrero) [12:39:11] 10Operations, 10Release-Engineering-Team, 10Scoring-platform-team: Contact number of some WMDE staff should be avalible to SRE/RelEng - https://phabricator.wikimedia.org/T210721 (10Ladsgroup) [12:41:11] CFisch_WMDE: your change should be on mwdebug1002 is it possible for you to test? [12:41:19] RECOVERY - puppet last run on people1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:41:22] yes I will have a look [12:43:04] dcausse: Either my test is wrong or it's not there :-/ [12:43:13] CFisch_WMDE: looking [12:43:27] RECOVERY - puppet last run on rutherfordium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:44:47] CFisch_WMDE: I see Html::element on line 186 at mwdebug1002 for php-1.33.0-wmf.6 [12:45:09] CFisch_WMDE: are you testing a wiki that is on wmf.6 ? [12:45:24] Ahh yeah good point thankts ^^' [12:46:24] jouncebot: next [12:46:24] In 0 hour(s) and 13 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181129T1300) [12:46:55] dcausse: nice works, thanks [12:47:04] (03PS27) 10Mathew.onipe: elasticsearch: cookbook for multi-cluster services rolling restart [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) [12:47:06] (03CR) 10Mathew.onipe: elasticsearch: cookbook for multi-cluster services rolling restart (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) (owner: 10Mathew.onipe) [12:48:43] !log dcausse@deploy1001 Synchronized php-1.33.0-wmf.6/extensions/TwoColConflict/: Fix unescaped HTML injected into conflict resolution interface (duration: 00m 53s) [12:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:01] CFisch_WMDE: it's live on all servers now [12:49:45] 10Puppet, 10Beta-Cluster-Infrastructure, 10Cloud-VPS, 10cloud-services-team: [Cloud VPS alert] Puppet failure on deployment-logstash2.deployment-prep.eqiad.wmflabs - https://phabricator.wikimedia.org/T210718 (10MarcoAurelio) [12:49:45] !log EU swat done [12:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:48] \o/ works, thanks [12:49:59] you're welcome! :) [12:54:55] PROBLEM - puppet last run on scb1004 is CRITICAL: CRITICAL: Puppet has 28 failures. Last run 3 minutes ago with 28 failures. Failed resources (up to 3 shown): Exec[absent_ensure_members],Exec[ops_ensure_members],Exec[wikidev_ensure_members],Exec[adm_ensure_members] [12:55:44] 10Operations, 10Release-Engineering-Team, 10Scoring-platform-team: Contact number of some WMDE staff should be avalible to SRE/RelEng - https://phabricator.wikimedia.org/T210721 (10WMDE-leszek) a:03WMDE-leszek I take it on me. I've briefly talked about this topic with @greg during Technical Conference. We'... [12:55:50] mm, checking that [12:55:51] 10Operations, 10Release-Engineering-Team, 10Scoring-platform-team: Contact number of some WMDE staff should be avalible to SRE/RelEng - https://phabricator.wikimedia.org/T210721 (10WMDE-leszek) p:05Triage>03High [12:56:33] 10Puppet, 10Beta-Cluster-Infrastructure, 10Cloud-VPS, 10cloud-services-team: [Cloud VPS alert] Puppet failure on deployment-logstash2.deployment-prep.eqiad.wmflabs - https://phabricator.wikimedia.org/T210718 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi Yes this has been fixed by me a few hours ago... [12:56:43] Could not evaluate: Cannot allocate memory - fork(2) [12:57:31] akosiaris mobrovac: we may have memory issues on scb1004 [12:57:44] sigh [12:57:45] looking [12:57:59] eventstreams maybe? [12:58:38] is that safe to restart? [12:58:57] yup it's eventstreams [12:59:04] weird, i can't restart it [12:59:09] don't have the rights [12:59:11] I can try [12:59:19] jynus: wait i'll do it from deploy1001 [12:59:25] so to depool it first [12:59:28] ok [12:59:33] to minimise impact [12:59:42] that is why I asked if it was safe, I don't know the service at all [12:59:47] thanks [13:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181129T1300) [13:00:05] RECOVERY - puppet last run on scb1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:00:06] you take care for now, standing by when you need me [13:00:29] !log mobrovac@deploy1001 Started restart [eventstreams/deploy@07033d4]: Restart ES on scb1004 due to possible memory leak (again) [13:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:57] ok, mem is back again [13:01:05] thnx jynus for pinging [13:01:30] I love how we got at least three ESes now, external storage / elastic search / event streams [13:01:39] lol [13:01:52] (03PS2) 10Aklapper: Phab: Use our custom Priority field value in tooltip on Reports page [puppet] - 10https://gerrit.wikimedia.org/r/455271 (https://phabricator.wikimedia.org/T91428) [13:01:56] (03PS2) 10Aklapper: Phab: Clarify that spaces are not allowed in user account names [puppet] - 10https://gerrit.wikimedia.org/r/455265 (https://phabricator.wikimedia.org/T179126) [13:01:59] don't worry, nothing happens if we restart an es* host ! [13:02:17] just all wikipedias go down, but nothing happens :-) [13:02:24] (03PS1) 10Marostegui: db-eqiad.php: Pool pc1009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476495 (https://phabricator.wikimedia.org/T208383) [13:02:39] hheeh [13:02:51] (03PS3) 10Aklapper: Order list of extensions by alphabet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455188 [13:03:05] in the past, all wikipedias wen't down and the server got reimaged and all data lost [13:05:03] "one of those days" [13:05:22] they changed the board an BIOS init was reseted [13:05:26] !log Upgrade pc3 tendril topology - T208383 [13:05:29] defaulting to network boot [13:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:30] T208383: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 [13:05:38] jynus: Oh boy, I remember that day [13:05:51] godog: that is why we now require a puppet change to reimage a db [13:06:19] jynus: indeed, I remember that day too, not fun at all [13:06:35] I mean, we have 6 copies and backups [13:07:09] over 2 datacenters, I am a bit over dramatizising it, we only lost one host [13:07:29] but indeed es and elastic hosts have been confused in the past [13:07:50] jynus: can you give this a review? https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/476495/ [13:09:42] (03CR) 10Jcrespo: [C: 031] db-eqiad.php: Pool pc1009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476495 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [13:09:52] ^but I will ask for a followup [13:10:02] what do you mean? [13:10:03] document the ips on the key [13:10:06] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Pool pc1009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476495 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [13:10:12] because it is now confusing [13:10:16] what do you mean? [13:10:36] # this should be something like 'pc1', 'pc2', 'pc3', but don't touch it! [13:10:41] aaah right [13:10:44] I will deploy a hotfix for Wikibase when things are settled [13:10:45] I was planning to create a ticket [13:10:47] :) [13:10:48] # sharding key bla bla bla [13:10:54] no, no need to fix it [13:10:58] just add a comment [13:11:08] jynus: I wanted to start to get some devs involved and I was planning to create a ticket to work out how to change those [13:11:10] because we in 2 months will not remember what that os [13:11:14] (03Merged) 10jenkins-bot: db-eqiad.php: Pool pc1009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476495 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [13:11:15] yes, but that is aside [13:11:49] I just want a comment without touching the code of what 10.64.0.12 is and why it shouldn't be touched [13:11:54] Ah sure :) [13:12:02] I can add that after the 256GB [13:12:13] however [13:12:24] or mayve you can create the ticket and notice it there, too [13:12:27] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Pool pc1009 in pc3 - T208383 (duration: 00m 53s) [13:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:31] T208383: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 [13:12:32] but without it people will remove it [13:12:39] 10Operations, 10monitoring: Address recurrent service check time out for "HP RAID" on swift backend hosts - https://phabricator.wikimedia.org/T210723 (10fgiunchedi) [13:12:51] You mean something like '10.64.0.12' => '10.64.48.174', # pc1010, D3 4.4TB 256GB # 10.64.0.12 is the key, but it should be pc1 eventually [13:12:52] D3: test - ignore - https://phabricator.wikimedia.org/D3 [13:12:54] something like that? [13:12:58] or at least document before [13:13:23] 'sharding function key' -> 'server ip address' [13:13:32] and then what you proposed [13:13:39] Ah I see what you mean [13:13:43] I will get a draft :) [13:13:49] with a WARNING, do not change T2234234 (ticket of why) [13:14:28] the exact thing doesn't matter, it is just a comment to make sure people don't change it unless they know what they are doing it [13:14:43] e.g. rob because he sees an ip that is unused [13:14:50] !log rebooting certcentral2001 to pick up SSBD-enabled qemu/kernel update [13:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:12] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10aborrero) Question: what is the warranty status of this server? would it make sense to get a more complete replacement by HP? (not just some spare pieces like disk and raid controllers) [13:15:15] and yes, normally people would ask us, but our code should outlive us :-) [13:16:03] (03CR) 10jenkins-bot: db-eqiad.php: Pool pc1009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476495 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [13:16:54] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10aborrero) [13:17:31] !log rebooting certcentral1001 to pick up SSBD-enabled qemu/kernel update [13:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:46] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1020 - https://phabricator.wikimedia.org/T194855 (10aborrero) [13:19:46] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Cmjohnson) @aborrero Unfortunately it's not that simple. Once we take delivery of a server we then have to work through technical support. We may be at the point where th... [13:21:01] jynus: T210725 [13:21:03] T210725: Replace parsercache keys to something more meaningful on db-XXXX.php - https://phabricator.wikimedia.org/T210725 [13:22:04] just a warning "DONT CHANGE THESE IPS T210725" or something woudl be enough [13:22:21] (03PS1) 10Marostegui: db-eqiad, db-codfw.php: Clarify sharding keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476497 [13:22:22] jynus: ^ [13:23:19] (03PS2) 10Marostegui: db-eqiad, db-codfw.php: Clarify sharding keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476497 [13:23:23] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad, db-codfw.php: Clarify sharding keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476497 (owner: 10Marostegui) [13:23:23] suggestion, put the do not change in caps and on the next line [13:23:31] PROBLEM - puppet last run on install2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/centralcerts/apt.rsa-2048.crt] [13:23:33] so it cannot be missed [13:24:00] right above the keys? [13:24:15] (yes, maybe a bit overboad, but look at 525+ [13:24:16] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad, db-codfw.php: Clarify sharding keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476497 (owner: 10Marostegui) [13:24:30] something that looks scary :-) [13:24:33] haha [13:24:48] so people read the ticket first [13:26:26] (03CR) 10Hashar: [C: 031] "Given that is solely for deployment-prep , we can +2/merge this at any time. Just make sure to rebase the repo on deployment.eqiad.wmnet " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476228 (https://phabricator.wikimedia.org/T205851) (owner: 10Filippo Giunchedi) [13:26:42] (03PS3) 10Marostegui: db-eqiad, db-codfw.php: Clarify sharding keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476497 [13:26:50] I think this way is clear enough ^ [13:27:50] 10Operations, 10Puppet, 10ORES, 10Scoring-platform-team, 10Wikimedia-Incident: Logrotate should restart services when more people are around - https://phabricator.wikimedia.org/T210720 (10akosiaris) I am afraid we can't really change it. It's been at 06:25am (UTC in our case) forever and people expect th... [13:28:09] +1, marostegui [13:28:16] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad, db-codfw.php: Clarify sharding keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476497 (owner: 10Marostegui) [13:28:31] jynus: thanks! [13:29:06] some tabs issue [13:29:12] CI complains [13:29:21] yeah, I don't understand why as on my vim they are tabs :| [13:29:44] (03PS4) 10Marostegui: db-eqiad, db-codfw.php: Clarify sharding keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476497 [13:30:51] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad, db-codfw.php: Clarify sharding keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476497 (owner: 10Marostegui) [13:31:01] this makes no sense [13:33:02] (03PS5) 10Marostegui: db-eqiad, db-codfw.php: Clarify sharding keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476497 [13:34:36] All good now \o/ [13:37:18] (03CR) 10Marostegui: [C: 032] db-eqiad, db-codfw.php: Clarify sharding keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476497 (owner: 10Marostegui) [13:37:42] marostegui: let me know when you are done and I will deploy some wikibase hotfix :) thx! [13:37:59] hashar: Ah sure, will not take long as soon as CI merges it! [13:38:21] (03Merged) 10jenkins-bot: db-eqiad, db-codfw.php: Clarify sharding keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476497 (owner: 10Marostegui) [13:39:11] marostegui: no worries, take your time :) [13:39:29] the CI job, I will have to look at it but it seems most of the wait time is due to php codesniffer :\\\ [13:39:37] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Clarify parsercache keys section (duration: 00m 52s) [13:39:38] One more file and I am done [13:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:23] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::php: add opcache tuning for php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/476499 (https://phabricator.wikimedia.org/T206341) [13:40:25] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::php: tune php-fpm parameters [puppet] - 10https://gerrit.wikimedia.org/r/476500 (https://phabricator.wikimedia.org/T206341) [13:40:27] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::php: armonize settings with HHVM [puppet] - 10https://gerrit.wikimedia.org/r/476501 [13:40:29] (03PS1) 10Giuseppe Lavagetto: mediawiki: configure php-fpm logging [puppet] - 10https://gerrit.wikimedia.org/r/476502 [13:40:37] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Clarify parsercache keys section (duration: 00m 53s) [13:40:38] hashar: I am done! [13:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:40] 10Operations, 10monitoring: Address recurrent service check time out for "HP RAID" on swift backend hosts - https://phabricator.wikimedia.org/T210723 (10fgiunchedi) Note we've been here before in {T172921} and sadly the command check timeout can be changed only globally on the icinga side, not per-service. [13:41:52] cool [13:42:03] 10Operations, 10monitoring, 10User-fgiunchedi: Address recurrent service check time out for "HP RAID" on swift backend hosts - https://phabricator.wikimedia.org/T210723 (10fgiunchedi) [13:42:21] (03PS2) 10Filippo Giunchedi: LabsServices: ship logs locally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476228 (https://phabricator.wikimedia.org/T205851) [13:44:09] (03CR) 10Filippo Giunchedi: [C: 032] LabsServices: ship logs locally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476228 (https://phabricator.wikimedia.org/T205851) (owner: 10Filippo Giunchedi) [13:45:02] !log hashar@deploy1001 Synchronized php-1.33.0-wmf.6/extensions/Wikibase: feature flag for globe coordinator formatter using kartographer - T184933 T210617 (duration: 01m 18s) [13:45:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:08] T210617: BadMethodCallException on Wikidata item pages containing coordinates with non-Earth globes - https://phabricator.wikimedia.org/T210617 [13:45:08] T184933: Display map for geocoordinate statements - https://phabricator.wikimedia.org/T184933 [13:45:14] (03Merged) 10jenkins-bot: LabsServices: ship logs locally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476228 (https://phabricator.wikimedia.org/T205851) (owner: 10Filippo Giunchedi) [13:45:16] gre [13:46:15] (03PS1) 10Huji: Dissallow eliminators to block certain groups on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476503 (https://phabricator.wikimedia.org/T210642) [13:48:13] (03CR) 10Daimona Eaytoy: Dissallow eliminators to block certain groups on fawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476503 (https://phabricator.wikimedia.org/T210642) (owner: 10Huji) [13:49:14] RECOVERY - puppet last run on install2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:49:35] (03CR) 10jenkins-bot: db-eqiad, db-codfw.php: Clarify sharding keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476497 (owner: 10Marostegui) [13:49:37] (03CR) 10jenkins-bot: LabsServices: ship logs locally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476228 (https://phabricator.wikimedia.org/T205851) (owner: 10Filippo Giunchedi) [13:50:09] (03PS2) 10Huji: Dissallow eliminators to block certain groups on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476503 (https://phabricator.wikimedia.org/T210642) [13:50:51] (03PS8) 10Gehel: profile::maps::osm_master: change osmupdater and osmimporter auth method to peer [puppet] - 10https://gerrit.wikimedia.org/r/475093 (https://phabricator.wikimedia.org/T206639) (owner: 10Mathew.onipe) [13:51:25] (03PS1) 10Hashar: wikidatawiki to 1.33.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476504 [13:57:13] (03CR) 10Gehel: [C: 032] profile::maps::osm_master: change osmupdater and osmimporter auth method to peer [puppet] - 10https://gerrit.wikimedia.org/r/475093 (https://phabricator.wikimedia.org/T206639) (owner: 10Mathew.onipe) [14:00:04] hashar: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - European version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181129T1400). [14:00:36] (03CR) 10Hashar: [C: 032] wikidatawiki to 1.33.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476504 (owner: 10Hashar) [14:01:37] (03Merged) 10jenkins-bot: wikidatawiki to 1.33.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476504 (owner: 10Hashar) [14:02:59] (03CR) 10jenkins-bot: wikidatawiki to 1.33.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476504 (owner: 10Hashar) [14:03:19] godog: oyu have forgotten to rebase on deployment.eqiad.wmnet ! I have done it ) [14:03:28] !log uploaded nodejs 6.11.0~dfsg-1+wmf3 to apt.wikimedia.org/stretch-wikimedia (backporting the current security fixes) [14:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:16] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: wikidatawiki to 1.33.0-wmf.6 [14:05:29] (03PS1) 10Andrew Bogott: Horizon: move projects to eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/476507 (https://phabricator.wikimedia.org/T204745) [14:05:50] jouncebot: next [14:05:51] In 2 hour(s) and 54 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181129T1700) [14:06:07] (03CR) 10Andrew Bogott: [C: 032] Horizon: move projects to eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/476507 (https://phabricator.wikimedia.org/T204745) (owner: 10Andrew Bogott) [14:06:20] hashar@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [14:06:36] ;( [14:08:08] hashar: indeed, thank you! [14:09:13] 10Operations, 10Electron-PDFs, 10Proton, 10Epic, and 4 others: [EPIC] New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10pmiazga) The service is ready, the remaining thing is to increase the CPU count (T197862). I'll talk with services today about this task. There a... [14:09:53] (03CR) 10Daimona Eaytoy: [C: 031] Dissallow eliminators to block certain groups on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476503 (https://phabricator.wikimedia.org/T210642) (owner: 10Huji) [14:10:07] !log test stashbot [14:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:12] ... [14:10:40] (03PS1) 10Hashar: all wikis to 1.33.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476509 [14:10:42] (03CR) 10Hashar: [C: 032] all wikis to 1.33.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476509 (owner: 10Hashar) [14:10:58] (03CR) 10jerkins-bot: [V: 04-1] all wikis to 1.33.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476509 (owner: 10Hashar) [14:11:05] (03CR) 10jerkins-bot: [V: 04-1] all wikis to 1.33.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476509 (owner: 10Hashar) [14:12:35] (03PS2) 10Hashar: all wikis to 1.33.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476509 [14:13:10] grr [14:13:17] I scrwed up the update of wikidatawiki [14:13:37] (03CR) 10Hashar: [C: 032] all wikis to 1.33.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476509 (owner: 10Hashar) [14:14:39] (03Merged) 10jenkins-bot: all wikis to 1.33.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476509 (owner: 10Hashar) [14:15:23] (03CR) 10Gehel: [C: 031] "LGTM, waiting to see if Volans has a last comment before merging." [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) (owner: 10Mathew.onipe) [14:16:16] (03CR) 10jenkins-bot: all wikis to 1.33.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476509 (owner: 10Hashar) [14:16:49] (03PS3) 10Vgutierrez: gerrit: Switch between old LE puppetization and certcentral using hiera [puppet] - 10https://gerrit.wikimedia.org/r/476459 (https://phabricator.wikimedia.org/T207050) [14:16:51] (03PS1) 10Vgutierrez: certcentral: Mimick letsencrypt::cert::integrated key_group [puppet] - 10https://gerrit.wikimedia.org/r/476510 (https://phabricator.wikimedia.org/T207050) [14:17:17] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.33.0-wmf.6 [14:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:15] pfff [14:23:51] (03CR) 10Vgutierrez: "pcc shows the expected changes (0600 --> 0640) in existing certcentral clients: https://puppet-compiler.wmflabs.org/compiler1002/13780/" [puppet] - 10https://gerrit.wikimedia.org/r/476510 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [14:25:05] (03CR) 10Mathew.onipe: maps: remove osmupdater and osmimporter hiera passwords (032 comments) [labs/private] - 10https://gerrit.wikimedia.org/r/468631 (https://phabricator.wikimedia.org/T206639) (owner: 10Mathew.onipe) [14:26:14] (03PS1) 10Hashar: Revert "all wikis to 1.33.0-wmf.6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476514 (https://phabricator.wikimedia.org/T206660) [14:26:34] (03CR) 10Hashar: [C: 032] Revert "all wikis to 1.33.0-wmf.6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476514 (https://phabricator.wikimedia.org/T206660) (owner: 10Hashar) [14:27:29] (03PS2) 10Mathew.onipe: maps: remove osmupdater and osmimporter hiera passwords [labs/private] - 10https://gerrit.wikimedia.org/r/468631 (https://phabricator.wikimedia.org/T206639) [14:27:32] 10Operations, 10Wikimedia-Mailing-lists: Post hold because of "invalid headers" in wikimediacz-l - https://phabricator.wikimedia.org/T210223 (10herron) Hello, I notice that on WikimediaCZ-l within the "privacy options..." > "spam filtering" section in the list admin, below "legacy spam filtering" there is a `b... [14:27:35] (03Merged) 10jenkins-bot: Revert "all wikis to 1.33.0-wmf.6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476514 (https://phabricator.wikimedia.org/T206660) (owner: 10Hashar) [14:29:59] !log hashar@deploy1001 scap failed: average error rate on 11/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details) [14:30:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:33] __main__.CheckServiceError: Generic connection error: HTTPConnectionPool(host='logstash1009.eqiad.wmnet', port=9200): Max retries exceeded with url: /logstash-*/_search (Caused by ReadTimeoutError("HTTPConnectionPool(host='logstash1009.eqiad.wmnet', port=9200): Read timed out. (read timeout=10)",)) [14:30:38] looks like logstash has some issue [14:31:40] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: Revert all wikis to 1.33.0-wmf.6 [14:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:58] will fill tasks for all the spam I got [14:35:45] (03CR) 10jenkins-bot: Revert "all wikis to 1.33.0-wmf.6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476514 (https://phabricator.wikimedia.org/T206660) (owner: 10Hashar) [14:40:13] (03PS3) 10Alexandros Kosiaris: First draft of a graphoid helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/434475 [14:42:43] (03PS3) 10Filippo Giunchedi: hieradata: add kafka_shipper::kafka_brokers variable to deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/476029 [14:43:31] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: add kafka_shipper::kafka_brokers variable to deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/476029 (owner: 10Filippo Giunchedi) [14:45:15] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team, 10User-Smalyshev: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service - https://phabricator.wikimedia.org/T206636 (10Andrew) So it sounds like you will need dedicated hardware t... [14:50:48] 10Operations, 10DBA, 10MediaWiki-Change-tagging, 10MW-1.33-notes (1.33.0-wmf.8; 2018-12-11), and 3 others: Migrate tag_summary usage to change_tag and drop the table - https://phabricator.wikimedia.org/T209525 (10Banyek) Do you need anything from our side this moment @Ladsgroup ? [14:56:13] 10Operations, 10DBA, 10MediaWiki-Change-tagging, 10MW-1.33-notes (1.33.0-wmf.8; 2018-12-11), and 3 others: Migrate tag_summary usage to change_tag and drop the table - https://phabricator.wikimedia.org/T209525 (10Ladsgroup) >>! In T209525#4785174, @Banyek wrote: > Do you need anything from our side this mo... [14:57:40] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Decom/return labvirt1010 and 1011 - https://phabricator.wikimedia.org/T210735 (10Andrew) [14:58:46] (03PS1) 10Vgutierrez: certcentral: Provide TLS certificates for lists.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/476521 (https://phabricator.wikimedia.org/T207050) [14:59:28] (03PS1) 10Andrew Bogott: Move labvirt1010/1011 to role::spare [puppet] - 10https://gerrit.wikimedia.org/r/476522 (https://phabricator.wikimedia.org/T210735) [15:02:04] (03CR) 10Gehel: [V: 032 C: 032] maps: remove osmupdater and osmimporter hiera passwords [labs/private] - 10https://gerrit.wikimedia.org/r/468631 (https://phabricator.wikimedia.org/T206639) (owner: 10Mathew.onipe) [15:02:29] (03CR) 10Marostegui: "Can we get a puppet compiler run to make sure it is a noop on the existing hosts?" [puppet] - 10https://gerrit.wikimedia.org/r/473546 (https://phabricator.wikimedia.org/T202367) (owner: 10Banyek) [15:03:51] (03CR) 10Andrew Bogott: [C: 032] Move labvirt1010/1011 to role::spare [puppet] - 10https://gerrit.wikimedia.org/r/476522 (https://phabricator.wikimedia.org/T210735) (owner: 10Andrew Bogott) [15:08:53] (03CR) 10Banyek: "> Can we get a puppet compiler run to make sure it is a noop on the" [puppet] - 10https://gerrit.wikimedia.org/r/473546 (https://phabricator.wikimedia.org/T202367) (owner: 10Banyek) [15:12:37] PROBLEM - Ensure that passive node gets the certificates from the active node as expected on certcentral2001 is CRITICAL: FILE_AGE CRITICAL: /var/lib/certcentral/live_certs/.rsync.status is 7357 seconds old and 0 bytes [15:12:42] (03CR) 10Jcrespo: "Note this doesn't yet handle firewalling." [puppet] - 10https://gerrit.wikimedia.org/r/473546 (https://phabricator.wikimedia.org/T202367) (owner: 10Banyek) [15:13:21] PROBLEM - Ensure cert-sync script runs successfully in the active node on certcentral1001 is CRITICAL: FILE_AGE CRITICAL: /var/lib/certcentral/live_certs/.rsync.done is 7400 seconds old and 0 bytes [15:13:25] PROBLEM - Keyholder SSH agent on certcentral1001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [15:13:39] PROBLEM - Keyholder SSH agent on certcentral2001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [15:13:52] (03CR) 10Alex Monk: [C: 031] certcentral: Mimick letsencrypt::cert::integrated key_group [puppet] - 10https://gerrit.wikimedia.org/r/476510 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [15:13:59] PROBLEM - Memory correctable errors -EDAC- on wtp2020 is CRITICAL: 4.001 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1var-server=wtp2020var-datasource=codfw%2520prometheus%252Fops [15:14:08] right... [15:14:14] OMW :) [15:14:25] PROBLEM - puppet last run on certcentral1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/certcentral-certs-sync] [15:15:18] 10Operations, 10DBA, 10MediaWiki-Change-tagging, 10MW-1.33-notes (1.33.0-wmf.8; 2018-12-11), and 3 others: Migrate tag_summary usage to change_tag and drop the table - https://phabricator.wikimedia.org/T209525 (10Marostegui) Which schema change? [15:15:33] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): Decom/return labvirt1010 and 1011 - https://phabricator.wikimedia.org/T210735 (10Andrew) a:03RobH [15:15:45] RECOVERY - Keyholder SSH agent on certcentral1001 is OK: OK: Keyholder is armed with all configured keys. [15:16:54] RECOVERY - Ensure cert-sync script runs successfully in the active node on certcentral1001 is OK: FILE_AGE OK: /var/lib/certcentral/live_certs/.rsync.done is 19 seconds old and 0 bytes [15:17:16] side effect of restarting certcentral nodes :) [15:17:41] 10Operations, 10DBA, 10MediaWiki-Change-tagging, 10MW-1.33-notes (1.33.0-wmf.8; 2018-12-11), and 3 others: Migrate tag_summary usage to change_tag and drop the table - https://phabricator.wikimedia.org/T209525 (10Ladsgroup) >>! In T209525#4785305, @Marostegui wrote: > Which schema change? `DROP TABLE tag... [15:18:13] (03CR) 10Vgutierrez: [C: 032] certcentral: Mimick letsencrypt::cert::integrated key_group [puppet] - 10https://gerrit.wikimedia.org/r/476510 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [15:18:16] !log $WHO Running Wikibase populateSitesTable.php on eswiktionary for T210732 [15:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:20] T210732: wiktionary: /rpc/RunSingleJob.php CannotCreateActorException from line 2540 of /srv/mediawiki/php-1.33.0-wmf.6/includes/user/User.php: Cannot create an actor for a usable name that is not an existing user - https://phabricator.wikimedia.org/T210732 [15:18:44] 10Operations, 10DBA, 10MediaWiki-Change-tagging, 10MW-1.33-notes (1.33.0-wmf.8; 2018-12-11), and 3 others: Migrate tag_summary usage to change_tag and drop the table - https://phabricator.wikimedia.org/T209525 (10Marostegui) >>! In T209525#4785310, @Ladsgroup wrote: >>>! In T209525#4785305, @Marostegui wro... [15:19:03] (03PS2) 10Vgutierrez: certcentral: Mimick letsencrypt::cert::integrated key_group [puppet] - 10https://gerrit.wikimedia.org/r/476510 (https://phabricator.wikimedia.org/T207050) [15:19:05] (03PS2) 10Vgutierrez: certcentral: Provide TLS certificates for lists.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/476521 (https://phabricator.wikimedia.org/T207050) [15:19:07] (03PS4) 10Vgutierrez: gerrit: Switch between old LE puppetization and certcentral using hiera [puppet] - 10https://gerrit.wikimedia.org/r/476459 (https://phabricator.wikimedia.org/T207050) [15:19:26] RECOVERY - puppet last run on certcentral1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:19:37] (03PS2) 10Gehel: elasticsearch: add psi & omega in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/476471 (https://phabricator.wikimedia.org/T207918) (owner: 10DCausse) [15:22:11] (03CR) 10BBlack: [C: 031] certcentral: Mimick letsencrypt::cert::integrated key_group [puppet] - 10https://gerrit.wikimedia.org/r/476510 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [15:22:48] (03PS3) 10Vgutierrez: certcentral: Mimick letsencrypt::cert::integrated key_group [puppet] - 10https://gerrit.wikimedia.org/r/476510 (https://phabricator.wikimedia.org/T207050) [15:23:14] RECOVERY - Ensure that passive node gets the certificates from the active node as expected on certcentral2001 is OK: FILE_AGE OK: /var/lib/certcentral/live_certs/.rsync.status is 400 seconds old and 0 bytes [15:24:05] \!log anomie@mwmaint1002 Running cleanupUsersWithNoId.php on eswiktionary recentchanges for T210732 [15:24:05] T210732: wiktionary: /rpc/RunSingleJob.php CannotCreateActorException from line 2540 of /srv/mediawiki/php-1.33.0-wmf.6/includes/user/User.php: Cannot create an actor for a usable name that is not an existing user - https://phabricator.wikimedia.org/T210732 [15:25:02] (03CR) 10Gehel: [C: 032] elasticsearch: add psi & omega in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/476471 (https://phabricator.wikimedia.org/T207918) (owner: 10DCausse) [15:25:04] RECOVERY - Keyholder SSH agent on certcentral2001 is OK: OK: Keyholder is armed with all configured keys. [15:25:10] (03PS3) 10Gehel: elasticsearch: add psi & omega in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/476471 (https://phabricator.wikimedia.org/T207918) (owner: 10DCausse) [15:26:35] (03CR) 10Vgutierrez: [C: 032] certcentral: Provide TLS certificates for lists.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/476521 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [15:26:43] (03PS3) 10Vgutierrez: certcentral: Provide TLS certificates for lists.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/476521 (https://phabricator.wikimedia.org/T207050) [15:27:31] 10Operations, 10DBA, 10Patch-For-Review, 10User-Banyek: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10Marostegui) >>! In T208383#4784872, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations),... [15:27:39] 10Operations, 10DBA, 10Patch-For-Review, 10User-Banyek: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10Marostegui) [15:28:54] \!log anomie@mwmaint1002 Running Wikibase/populateSitesTable.php and cleanupUsersWithNoId.php on several other wiktionaries for T210732 [15:29:29] that's weird [15:29:37] why is logmsgbot escaping its !log messages? [15:29:39] anomie, ^ [15:29:56] !log activating multiple elasticsearch instances on cirrus / eqiad - T207918 [15:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:00] T207918: Refactor current code base to support multiple elasticsearch instances/multiple elasticsearch clusters - https://phabricator.wikimedia.org/T207918 [15:30:07] gehel: thanks!^ [15:30:21] dcausse: wait until it actually works to thank me :) [15:30:27] yes sure :) [15:31:19] 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10Banyek) I'll check what are the needs for achieving these goals in the DBA perspective [15:31:58] Krenair: ... It's me trying to use the logmsg command on mwmaint1002, and weirdness with bash wanting to treat "!log" as an event but apparently at the same time including the backslash when I escape it. [15:32:07] s/logmsg/dologmsg/ [15:32:14] (03CR) 10Imarlier: profile::mediawiki::php: add opcache tuning for php-fpm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/476499 (https://phabricator.wikimedia.org/T206341) (owner: 10Giuseppe Lavagetto) [15:32:23] ooh [15:32:25] right [15:32:35] because that part is not hardcoded in logmsgbot [15:32:42] !log anomie@mwmaint1002 Running Wikibase/populateSitesTable.php and cleanupUsersWithNoId.php on several other wiktionaries for T210732 [15:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:46] T210732: wiktionary: /rpc/RunSingleJob.php CannotCreateActorException from line 2540 of /srv/mediawiki/php-1.33.0-wmf.6/includes/user/User.php: Cannot create an actor for a usable name that is not an existing user - https://phabricator.wikimedia.org/T210732 [15:32:52] I filled a few blocking tasks (they might not all be blockers though) [15:32:53] There, that time it worked. [15:32:58] anomie: thanks for the patches :) [15:33:16] I hvae to head back home for car maintenance, will be back in a couple hours maybe and catch up later this evening [15:33:43] hashar: That one was no patches, apparently just the need to run populateSitesTable.php when enabling Wikibase on wikis never made it into the right documentation. [15:34:01] anomie: ohhhh so that is sounds like an easy fix isn't it ? :) [15:34:27] I will redo the all group deployment later during the US train window [15:34:56] hashar: Should be fixed already with the maintenance scripts I just ran. [15:34:58] (03CR) 10DCausse: [C: 031] elasticsearch: configure LVS endpoint for new codfw clusters [puppet] - 10https://gerrit.wikimedia.org/r/475753 (https://phabricator.wikimedia.org/T207195) (owner: 10Gehel) [15:35:19] !log T196507 downtimed and powercycled cloudvirt1019 [15:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:23] T196507: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 [15:36:34] anomie: excellent. Thank you :) I will catch up later this evening [15:52:58] 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10bmansurov) @Banyek thanks for helping. While we port repositories to Gerrit, [[ https://github.com/kodchi/research-article-recommender-deploy | here ]]... [15:54:35] !log shutting down ms-be2047 for maintenance [15:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:44] (03PS2) 10Cwhite: hiera: add cluster definition to recursor role [puppet] - 10https://gerrit.wikimedia.org/r/476393 (https://phabricator.wikimedia.org/T210486) [16:04:23] 10Operations, 10ops-eqiad, 10Traffic, 10netops: lvs1006 down - https://phabricator.wikimedia.org/T210683 (10BBlack) p:05Normal>03High [16:04:55] 10Operations, 10Icinga, 10Scoring-platform-team: Add ahalfaker to ORES-related icinga contacts - https://phabricator.wikimedia.org/T210742 (10Dzahn) [16:06:34] (03PS3) 10Cwhite: hiera: add cluster definition to recursor role [puppet] - 10https://gerrit.wikimedia.org/r/476393 (https://phabricator.wikimedia.org/T210486) [16:08:12] 10Operations, 10ops-codfw: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10Papaul) Please replace the remaining hardware sent. I found the hardware below out of date. IDRAC at 3.21.21.21 CPLD at 1.4.9 BIOS at 1.0.1 Suggested action plan: 1. Clear System Event... [16:08:54] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10GTirloni) If the batter is installed and, as the HPE advisories suggest, the firmwares have been updated _and_ we have many other servers with this controller that are work... [16:09:13] (03PS3) 10Jcrespo: admin: Add jgleeson access to production cluster [puppet] - 10https://gerrit.wikimedia.org/r/476004 (https://phabricator.wikimedia.org/T208432) [16:09:46] (03CR) 10jerkins-bot: [V: 04-1] admin: Add jgleeson access to production cluster [puppet] - 10https://gerrit.wikimedia.org/r/476004 (https://phabricator.wikimedia.org/T208432) (owner: 10Jcrespo) [16:10:57] !log anomie@mwmaint1002 Running Wikibase/populateSitesTable.php and cleanupUsersWithNoId.php on more wiktionaries, incubatorwiki, and sourceswiki for T210732 [16:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:01] T210732: wiktionary: /rpc/RunSingleJob.php CannotCreateActorException from line 2540 of /srv/mediawiki/php-1.33.0-wmf.6/includes/user/User.php: Cannot create an actor for a usable name that is not an existing user - https://phabricator.wikimedia.org/T210732 [16:12:22] users with no id? who's that even possible? :P [16:17:44] Platonides: It used to be possible for imports and cross-wiki things like Wikidata's recentchanges entries to attribute something to a named user without that user existing locally. [16:18:02] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10GTirloni) @robh @Cmjohnson Can we get a technician from HP on site with various parts (cards, batteries, etc) to try and fix this? [16:18:09] (03PS4) 10Jcrespo: admin: Add jgleeson access to production cluster [puppet] - 10https://gerrit.wikimedia.org/r/476004 (https://phabricator.wikimedia.org/T208432) [16:21:16] that's true [16:21:36] didn't think on that [16:21:44] although I don't see why it would be a problem :P [16:22:39] RECOVERY - Host lvs1006 is UP: PING WARNING - Packet loss = 86%, RTA = 0.33 ms [16:24:19] 10Operations, 10ops-eqiad, 10Traffic, 10netops: lvs1006 down - https://phabricator.wikimedia.org/T210683 (10Cmjohnson) @bblack @ayounsi sfp-t was bad, replaced and the link is up [16:27:44] 10Operations, 10ops-eqiad, 10Traffic, 10netops: lvs1006 down - https://phabricator.wikimedia.org/T210683 (10Cmjohnson) 05Open>03Resolved [16:28:38] 10Operations, 10ops-codfw: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10Papaul) Did all the dell engineer recommended above. Waiting to proceed to step 10 . [16:31:05] PROBLEM - puppet last run on db1117 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:38:54] (03PS3) 10Herron: rsyslog:input:file add multiline handling and ship gerrit logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/475840 (https://phabricator.wikimedia.org/T141324) [16:39:18] (03CR) 10jerkins-bot: [V: 04-1] rsyslog:input:file add multiline handling and ship gerrit logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/475840 (https://phabricator.wikimedia.org/T141324) (owner: 10Herron) [16:40:14] 10Puppet, 10Beta-Cluster-Infrastructure, 10Cloud-VPS, 10cloud-services-team: [Cloud VPS alert] Puppet failure on deployment-logstash2.deployment-prep.eqiad.wmflabs - https://phabricator.wikimedia.org/T210718 (10MarcoAurelio) Thanks :) [16:40:19] (03PS4) 10Herron: rsyslog:input:file add multiline handling and ship gerrit logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/475840 (https://phabricator.wikimedia.org/T141324) [16:40:42] (03CR) 10jerkins-bot: [V: 04-1] rsyslog:input:file add multiline handling and ship gerrit logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/475840 (https://phabricator.wikimedia.org/T141324) (owner: 10Herron) [16:43:26] (03PS5) 10Herron: rsyslog:input:file add multiline handling and ship gerrit logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/475840 (https://phabricator.wikimedia.org/T141324) [16:43:39] 10Operations, 10procurement, 10Discovery-Search (Current work): Setup elasticsearch on new codfw servers - https://phabricator.wikimedia.org/T210265 (10RobH) [16:43:47] 10Operations, 10Discovery-Search (Current work): Setup elasticsearch on new codfw servers - https://phabricator.wikimedia.org/T210265 (10RobH) [16:44:17] 10Operations, 10Discovery-Search (Current work): Setup elasticsearch on new codfw servers - https://phabricator.wikimedia.org/T210265 (10RobH) I went ahead and moved this out of S4 (as its not procurement), back into S1, and removed the #procurement project. [16:46:52] (03CR) 10Paladox: [C: 031] gerrit: Switch between old LE puppetization and certcentral using hiera [puppet] - 10https://gerrit.wikimedia.org/r/476459 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [16:47:06] (03Abandoned) 10Mathew.onipe: base::monitoring::host: added icinga prometheus check for network drops [puppet] - 10https://gerrit.wikimedia.org/r/465450 (https://phabricator.wikimedia.org/T206114) (owner: 10Mathew.onipe) [16:47:43] (03Abandoned) 10Mathew.onipe: Add elasticsearch [cookbooks] - 10https://gerrit.wikimedia.org/r/462514 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [16:48:23] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1002/13782/" [puppet] - 10https://gerrit.wikimedia.org/r/475840 (https://phabricator.wikimedia.org/T141324) (owner: 10Herron) [16:49:49] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Onboard at least 10 new non-sensitive log producers to the logging pipeline - https://phabricator.wikimedia.org/T205852 (10herron) [16:49:56] 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10Banyek) as we were talking with @bmansurov I learned that we need to keep old data after new import is not considered working. my recommendation is to... [16:50:17] (03PS6) 10Herron: rsyslog:input:file add multiline handling and ship gerrit logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/475840 (https://phabricator.wikimedia.org/T141324) [16:50:25] I'd like to ask your opinion about https://phabricator.wikimedia.org/T208622#4785750 [16:50:30] tomorrow [16:50:33] today I leave [16:50:46] bye [16:51:53] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): Decom/return labvirt1010 and 1011 - https://phabricator.wikimedia.org/T210735 (10RobH) [16:53:01] (nothere) [16:57:44] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Decom/return labvirt1010 and 1011 - https://phabricator.wikimedia.org/T210735 (10RobH) [16:58:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): decommission (lease return) labvirt101[01].eqiad.wmnet - https://phabricator.wikimedia.org/T210735 (10RobH) p:05Triage>03High [17:00:04] godog and _joe_: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181129T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:02:07] !log decom of labvirt101[01] continuing [17:02:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:09] RECOVERY - puppet last run on db1117 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:02:13] they shouldnt echo, but just in case... [17:03:54] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): decommission (lease return) labvirt101[01].eqiad.wmnet - https://phabricator.wikimedia.org/T210735 (10RobH) Switch ports on asw2-b-eqiad: ` robh@asw2-b-eqiad> show interfaces descriptions | grep labvirt1010 ge-3/0/14 up up l... [17:05:41] (03PS1) 10Jcrespo: admin: Add addshore to graphite-admins; allow _grahite commands [puppet] - 10https://gerrit.wikimedia.org/r/476558 (https://phabricator.wikimedia.org/T208750) [17:05:45] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): decommission (lease return) labvirt101[01].eqiad.wmnet - https://phabricator.wikimedia.org/T210735 (10RobH) [17:07:03] (03CR) 10Jcrespo: "This is a first version without even looking at the puppet classes for the machine, I need to properly review it, too." [puppet] - 10https://gerrit.wikimedia.org/r/476558 (https://phabricator.wikimedia.org/T208750) (owner: 10Jcrespo) [17:09:41] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): decommission (lease return) labvirt101[01].eqiad.wmnet - https://phabricator.wikimedia.org/T210735 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for labvirt1010.eqiad.wmnet and performed the following actions: - Revoke... [17:09:43] (03PS1) 10Faidon Liambotis: openstack: make ::neutron::dmz_cidr an array [puppet] - 10https://gerrit.wikimedia.org/r/476567 (https://phabricator.wikimedia.org/T210754) [17:10:04] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): decommission (lease return) labvirt101[01].eqiad.wmnet - https://phabricator.wikimedia.org/T210735 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for labvirt1011.eqiad.wmnet and performed the following actions: - Revoke... [17:10:40] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): decommission (lease return) labvirt101[01].eqiad.wmnet - https://phabricator.wikimedia.org/T210735 (10RobH) [17:10:50] (03CR) 10jerkins-bot: [V: 04-1] openstack: make ::neutron::dmz_cidr an array [puppet] - 10https://gerrit.wikimedia.org/r/476567 (https://phabricator.wikimedia.org/T210754) (owner: 10Faidon Liambotis) [17:14:05] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team, 10User-Smalyshev: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service - https://phabricator.wikimedia.org/T206636 (10Smalyshev) > In the meantime, can I delete t206636-3? Yes. [17:14:50] (03CR) 10Filippo Giunchedi: [C: 031] "Mostly nits, LGTM otherwise" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/475840 (https://phabricator.wikimedia.org/T141324) (owner: 10Herron) [17:15:10] (03PS1) 10RobH: removing references to decom servers labvirt101[01] [puppet] - 10https://gerrit.wikimedia.org/r/476570 (https://phabricator.wikimedia.org/T210735) [17:18:18] (03CR) 10RobH: [C: 032] removing references to decom servers labvirt101[01] [puppet] - 10https://gerrit.wikimedia.org/r/476570 (https://phabricator.wikimedia.org/T210735) (owner: 10RobH) [17:18:53] (03PS16) 10DCausse: [cirrus] Add temp clusters but still write to the old ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475750 (https://phabricator.wikimedia.org/T210381) [17:18:55] (03PS6) 10DCausse: [cirrus] Start writing to psi & omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476271 (https://phabricator.wikimedia.org/T210381) [17:18:57] (03PS6) 10DCausse: [cirrus] Start using replica group settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381) [17:18:59] (03PS8) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) [17:19:15] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Add temp clusters but still write to the old ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475750 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [17:19:26] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): decommission (lease return) labvirt101[01].eqiad.wmnet - https://phabricator.wikimedia.org/T210735 (10RobH) [17:19:32] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Start writing to psi & omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476271 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [17:19:47] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Start using replica group settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [17:20:07] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [17:20:14] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): decommission (lease return) labvirt101[01].eqiad.wmnet - https://phabricator.wikimedia.org/T210735 (10RobH) a:05RobH>03Cmjohnson Ok, these are ready for @cmjohnson to do the SSD smartctl secure erase on these systems. As these are le... [17:20:23] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10cloud-services-team (Kanban): decommission (lease return) labvirt101[01].eqiad.wmnet - https://phabricator.wikimedia.org/T210735 (10RobH) [17:28:30] (03CR) 10Herron: "bunch of nitpicks" (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/475352 (https://phabricator.wikimedia.org/T205851) (owner: 10Filippo Giunchedi) [17:32:35] (03CR) 10Elukey: "Added Mark/Faidon to see if this can be merged now as opposed to wait for the SRE meeting, since basically the team already approved what " [puppet] - 10https://gerrit.wikimedia.org/r/475984 (owner: 10Elukey) [17:33:22] (03CR) 10Herron: [C: 04-2] "> Swift produces a significant amount of logs (~1-4G/day compressed)" [puppet] - 10https://gerrit.wikimedia.org/r/475898 (https://phabricator.wikimedia.org/T63780) (owner: 10Herron) [17:34:51] !log anomie@deploy1001 Synchronized php-1.33.0-wmf.6/includes/revisiondelete/RevisionDeleteUser.php: Fix RevisionDeleteUser rev_actor query for MySQL (T210628) (duration: 00m 53s) [17:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:46] 10Operations, 10ORES, 10vm-requests, 10Scoring-platform-team (Current): New node request: oresrdb[12]003 - https://phabricator.wikimedia.org/T210582 (10akosiaris) [17:41:55] !log anomie@deploy1001 Synchronized php-1.33.0-wmf.6/includes/revisiondelete/RevisionDeleteUser.php: Fix RevisionDeleteUser rev_actor query for MySQL, for real this time (T210628) (duration: 00m 53s) [17:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:00] !log mforns@deploy1001 Started deploy [analytics/refinery@40b1972]: deploying refinery to refinery-source version v0.0.81 [17:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:23] Krenair: do i remember correctly you once had a script that parsed the admin.yaml [17:44:50] where you could easily do stuff like "are all members of group A also in group B" from the yaml [17:48:09] 10Operations, 10DBA, 10Patch-For-Review: Audit "misc" cluster hosts - https://phabricator.wikimedia.org/T210486 (10jcrespo) Adding DBA for the few db hosts that shouldn't be there, remove the tag when those are fixed: * New pc* hosts * New dbstore* hosts * dbmonitor (unsure of that one, that is most likely... [17:48:13] 10Operations, 10Proton, 10Services (doing): Increase the CPU count for proton[12]00[12] - https://phabricator.wikimedia.org/T197862 (10pmiazga) [17:48:16] 10Operations, 10Electron-PDFs, 10Proton, 10Epic, and 4 others: [EPIC] New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10pmiazga) [17:49:00] !log mforns@deploy1001 Finished deploy [analytics/refinery@40b1972]: deploying refinery to refinery-source version v0.0.81 (duration: 06m 01s) [17:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:41] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:49:43] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb={LIST,PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:54:23] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:54:25] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:55:45] !log remove test netbox user from cr3-ulsfo - T205898 [17:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:48] T205898: Netbox: explore NAPALM integration - https://phabricator.wikimedia.org/T205898 [17:56:15] (03PS7) 10Herron: rsyslog: input::file add multiline handling & ship gerrit logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/475840 (https://phabricator.wikimedia.org/T141324) [17:56:17] 10Operations, 10DBA, 10Patch-For-Review: Audit "misc" cluster hosts - https://phabricator.wikimedia.org/T210486 (10Marostegui) I can fix `regex.yaml` to add the new parsercache there, but the dbstore appearing on that list do not exist: dbstore1003 and dbstore1005 [17:56:27] (03CR) 10Herron: rsyslog: input::file add multiline handling & ship gerrit logs to ELK (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/475840 (https://phabricator.wikimedia.org/T141324) (owner: 10Herron) [17:57:35] (03PS8) 10Herron: rsyslog: input::file add multiline handling & ship gerrit logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/475840 (https://phabricator.wikimedia.org/T141324) [17:57:57] 10Operations, 10SRE-Access-Requests: Requesting access to "stat1007" for "researchers" group - https://phabricator.wikimedia.org/T210757 (10bmansurov) [17:58:45] (03CR) 10Herron: [C: 032] rsyslog: input::file add multiline handling & ship gerrit logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/475840 (https://phabricator.wikimedia.org/T141324) (owner: 10Herron) [17:58:50] (03PS1) 10Ayounsi: Revert "Netbox, set the napalm_username variable and matching keyholder" [puppet] - 10https://gerrit.wikimedia.org/r/476581 (https://phabricator.wikimedia.org/T205898) [17:59:21] (03CR) 10jerkins-bot: [V: 04-1] Revert "Netbox, set the napalm_username variable and matching keyholder" [puppet] - 10https://gerrit.wikimedia.org/r/476581 (https://phabricator.wikimedia.org/T205898) (owner: 10Ayounsi) [18:00:01] 10Operations, 10SRE-Access-Requests: Add existing group "researchers" to "hosts that can produce recommendation API dumps - https://phabricator.wikimedia.org/T210757 (10bmansurov) [18:00:04] cscott, arlolra, subbu, halfak, and Amir1: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181129T1800). [18:05:39] (03Abandoned) 10Ayounsi: Revert "Netbox, set the napalm_username variable and matching keyholder" [puppet] - 10https://gerrit.wikimedia.org/r/476581 (https://phabricator.wikimedia.org/T205898) (owner: 10Ayounsi) [18:05:41] (03CR) 10DCausse: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475750 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [18:05:58] (03CR) 10Jgleeson: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/476004 (https://phabricator.wikimedia.org/T208432) (owner: 10Jcrespo) [18:06:04] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Add temp clusters but still write to the old ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475750 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [18:07:07] (03PS1) 10Ayounsi: Revert "Netbox, set the napalm_username variable and matching keyholder" [puppet] - 10https://gerrit.wikimedia.org/r/476583 (https://phabricator.wikimedia.org/T205898) [18:07:29] 10Operations, 10ops-eqiad: eqiad pdu audit - https://phabricator.wikimedia.org/T210760 (10RobH) p:05Triage>03Normal [18:07:40] (03PS1) 10Ayounsi: Revert "Add fake ssh keys for netbox user" [labs/private] - 10https://gerrit.wikimedia.org/r/476584 (https://phabricator.wikimedia.org/T205898) [18:08:44] (03CR) 10Ayounsi: [C: 032] Revert "Netbox, set the napalm_username variable and matching keyholder" [puppet] - 10https://gerrit.wikimedia.org/r/476583 (https://phabricator.wikimedia.org/T205898) (owner: 10Ayounsi) [18:09:03] PROBLEM - puppet last run on phab2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:09:05] PROBLEM - puppet last run on phab1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:11:54] (03CR) 10Paladox: rsyslog: input::file add multiline handling & ship gerrit logs to ELK (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/475840 (https://phabricator.wikimedia.org/T141324) (owner: 10Herron) [18:12:22] (03PS2) 10Ayounsi: Revert "Netbox, set the napalm_username variable and matching keyholder" [puppet] - 10https://gerrit.wikimedia.org/r/476583 (https://phabricator.wikimedia.org/T205898) [18:12:35] PROBLEM - puppet last run on people1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:12:45] (03CR) 10Herron: [C: 032] rsyslog: input::file add multiline handling & ship gerrit logs to ELK (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/475840 (https://phabricator.wikimedia.org/T141324) (owner: 10Herron) [18:12:53] 10Operations, 10ops-eqiad: eqiad pdu audit - https://phabricator.wikimedia.org/T210760 (10RobH) [18:13:44] cscott, arlolra, subbu, halfak, and Amir1: Are you using your window today? If not, I'd like to deploy a config change. [18:13:53] (03PS8) 10DCausse: [cirrus] Allow configuration arrays in production services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475747 (https://phabricator.wikimedia.org/T210381) [18:13:55] (03PS8) 10DCausse: [cirrus] switch to explicit config in production services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475748 (https://phabricator.wikimedia.org/T210381) [18:13:57] (03PS8) 10DCausse: [cirrus] prepare multi-instance services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475749 (https://phabricator.wikimedia.org/T210381) [18:13:59] (03PS17) 10DCausse: [cirrus] Add temp clusters but still write to the old ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475750 (https://phabricator.wikimedia.org/T210381) [18:14:02] (03PS7) 10DCausse: [cirrus] Start writing to psi & omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476271 (https://phabricator.wikimedia.org/T210381) [18:14:03] anomie, no parsoid deploy today [18:14:04] (03PS7) 10DCausse: [cirrus] Start using replica group settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381) [18:14:06] (03PS9) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) [18:14:21] No deploy for ores now [18:14:31] PROBLEM - puppet last run on rutherfordium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:15:08] these recent puppet last runs are my bad, pushing a fix shortly [18:16:22] (03PS1) 10Herron: rsyslog::input::file fix startmsg_regex data type [puppet] - 10https://gerrit.wikimedia.org/r/476587 [18:16:36] 10Operations, 10SRE-Access-Requests: Add existing group "researchers" to "hosts that can produce recommendation API dumps - https://phabricator.wikimedia.org/T210757 (10Dzahn) i suggested to create this access request. per IRC chat, adding some detail what is needed / requested here: add the existing admin... [18:17:27] (03CR) 10Herron: [C: 032] rsyslog::input::file fix startmsg_regex data type [puppet] - 10https://gerrit.wikimedia.org/r/476587 (owner: 10Herron) [18:18:07] PROBLEM - puppet last run on puppetdb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:19:19] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work): Broken elasticsearch-prometheus-exporter service on logstash nodes after reboot - https://phabricator.wikimedia.org/T210597 (10EBjune) [18:19:24] herron i now get: Detail: undefined method `empty?' for nil:NilClass [18:19:35] Filepath: /etc/puppet/modules/rsyslog/templates/input/file.erb [18:19:37] yep, reverting [18:19:57] herron i think replace .empty will fix it (ie just @var [18:20:35] need to correct the template and there’s another issue with puppet escaping the regex string as well [18:20:49] I’ll revert and fix outside prod then resubmit [18:21:06] (03PS1) 10Herron: Revert "rsyslog: input::file add multiline handling & ship gerrit logs to ELK" [puppet] - 10https://gerrit.wikimedia.org/r/476590 [18:21:28] (03CR) 10jerkins-bot: [V: 04-1] Revert "rsyslog: input::file add multiline handling & ship gerrit logs to ELK" [puppet] - 10https://gerrit.wikimedia.org/r/476590 (owner: 10Herron) [18:23:18] (03PS2) 10Herron: Revert "rsyslog: input::file add multiline handling & ship gerrit logs to ELK" [puppet] - 10https://gerrit.wikimedia.org/r/476590 [18:24:09] 10Operations, 10SRE-Access-Requests: Add existing group "researchers" to "hosts that can produce recommendation API dumps" (role statistics::private) - https://phabricator.wikimedia.org/T210757 (10Dzahn) [18:24:39] (03CR) 10Herron: [C: 032] Revert "rsyslog: input::file add multiline handling & ship gerrit logs to ELK" [puppet] - 10https://gerrit.wikimedia.org/r/476590 (owner: 10Herron) [18:24:44] 10Operations, 10Analytics, 10Analytics-Cluster, 10SRE-Access-Requests: Add existing group "researchers" to "hosts that can produce recommendation API dumps" (role statistics::private) - https://phabricator.wikimedia.org/T210757 (10Dzahn) [18:25:32] (03PS1) 10Anomie: Set comment migration stage to new on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476591 (https://phabricator.wikimedia.org/T166733) [18:25:36] (03PS1) 10Herron: rsyslog: input::file add multiline handling & ship gerrit logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/476592 [18:25:52] (03CR) 10Anomie: [C: 032] "Deploying config change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476591 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [18:26:29] (03PS2) 10Herron: rsyslog: input::file add multiline handling & ship gerrit logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/476592 (https://phabricator.wikimedia.org/T141324) [18:26:40] (03CR) 10Herron: "follow up to Ic843d3b0a1a40f831e569006776c24ec7cf54033" [puppet] - 10https://gerrit.wikimedia.org/r/476592 (https://phabricator.wikimedia.org/T141324) (owner: 10Herron) [18:27:27] (03Merged) 10jenkins-bot: Set comment migration stage to new on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476591 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [18:27:47] PROBLEM - puppet last run on phab1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:28:46] (03CR) 10Paladox: rsyslog: input::file add multiline handling & ship gerrit logs to ELK (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/476592 (https://phabricator.wikimedia.org/T141324) (owner: 10Herron) [18:28:54] !log anomie@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Setting comment migration to write-new/read-new on group 0 (T166733) (duration: 00m 52s) [18:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:58] T166733: Deploy refactored comment storage - https://phabricator.wikimedia.org/T166733 [18:29:31] PROBLEM - puppet last run on puppetdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:29:46] 10Operations, 10Analytics, 10Analytics-Cluster, 10SRE-Access-Requests: Add existing group "researchers" to "hosts that can produce recommendation API dumps" (role statistics::private) - https://phabricator.wikimedia.org/T210757 (10Dzahn) what it would mean: all members of "researchers": ` members: [a... [18:31:11] (03CR) 10jenkins-bot: Set comment migration stage to new on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476591 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [18:32:21] 10Operations, 10Analytics, 10Analytics-Cluster, 10SRE-Access-Requests: Add existing group "researchers" to "hosts that can produce recommendation API dumps" (role statistics::private) - https://phabricator.wikimedia.org/T210757 (10Dzahn) The alternative is creating an entirely new group with a better name... [18:33:20] 10Operations, 10ops-eqiad, 10media-storage: rack/setup/install ms-be10[44-50].eqiad.wmnet - https://phabricator.wikimedia.org/T209618 (10Cmjohnson) @fgiunchedi For racking this is the space I have I can do at least 3 in A with out a problem, I can only 2 in C and that would be the same rack (C2) B can ha... [18:33:39] RECOVERY - puppet last run on puppetdb1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:33:59] 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10Dzahn) [18:34:01] 10Operations, 10Analytics, 10Analytics-Cluster, 10SRE-Access-Requests: Add existing group "researchers" to "hosts that can produce recommendation API dumps" (role statistics::private) - https://phabricator.wikimedia.org/T210757 (10Dzahn) [18:34:43] RECOVERY - puppet last run on puppetdb2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:34:55] RECOVERY - puppet last run on phab2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:34:58] RECOVERY - puppet last run on phab1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:36:01] hey all.. which username / password should i be using for https://logstash-beta.wmflabs.org/ - i cant seem to get access with wikitech credentials [18:37:19] and if i dont have access could somebody give me access? [18:39:48] jdlrobson: https://www.mediawiki.org/wiki/Beta_Cluster#Testing_changes_on_Beta_Cluster [18:40:49] jynus: you are a wonderful person. Thank you! you've saved me hours <3 [18:40:53] 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10Dzahn) Also talked with bmansurov about the pupetization part. There is my pending Gerrit change for the git clone part (pending repos are on Gerrit) a... [18:41:38] lol I just seached logstash on the wiki [18:42:11] 10Operations, 10CirrusSearch, 10Discovery-Search: Find an alternative to curl connection pooling available in HHVM - https://phabricator.wikimedia.org/T210717 (10EBernhardson) Throwing some ideas out there: * PHP requests are stateless, and trying to share something, even an open socket, is painful. * It se... [18:42:16] but your searching skills are clearly better than mine :) [18:43:39] RECOVERY - puppet last run on people1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:45:13] (03CR) 10Dzahn: [C: 031] "lgtm! https://puppet-compiler.wmflabs.org/compiler1002/13785/cobalt.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/476459 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [18:45:35] RECOVERY - puppet last run on rutherfordium is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:46:55] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Browser-Tests, and 2 others: Wikimedia\Rdbms\LoadBalancer::pickReaderIndex: all replica DBs lagged. Switch to read-only mode - https://phabricator.wikimedia.org/T210557 (10Jdlrobson) p:05High>03Unbreak! [18:47:46] (03PS3) 10Ayounsi: Revert "Netbox, set the napalm_username variable and matching keyholder" [puppet] - 10https://gerrit.wikimedia.org/r/476583 (https://phabricator.wikimedia.org/T205898) [18:48:38] (03PS3) 10Herron: rsyslog: input::file add multiline handling & ship gerrit logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/476592 (https://phabricator.wikimedia.org/T141324) [18:50:05] (03CR) 10Ayounsi: [C: 032] Revert "Netbox, set the napalm_username variable and matching keyholder" [puppet] - 10https://gerrit.wikimedia.org/r/476583 (https://phabricator.wikimedia.org/T205898) (owner: 10Ayounsi) [18:50:06] 10Operations, 10Analytics, 10Analytics-Cluster, 10SRE-Access-Requests: Add existing group "researchers" to "hosts that can produce recommendation API dumps" (role statistics::private) - https://phabricator.wikimedia.org/T210757 (10Nuria) What are recommendation api dumps? If they are destined for productio... [18:50:11] (03CR) 10Ayounsi: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/13786/" [puppet] - 10https://gerrit.wikimedia.org/r/476583 (https://phabricator.wikimedia.org/T205898) (owner: 10Ayounsi) [18:50:24] (03PS4) 10Ayounsi: Revert "Netbox, set the napalm_username variable and matching keyholder" [puppet] - 10https://gerrit.wikimedia.org/r/476583 (https://phabricator.wikimedia.org/T205898) [18:51:42] greg-g - I backported the MobileFrontend fix to 1.33.0-wmf.6 [18:52:19] CI is in progress, it should be merged soon [18:53:04] (03CR) 10Ottomata: [C: 031] "Nice, is the ensure_resource( 'file' ... ) just to work around if !defined(File[...]) stuff?" [puppet] - 10https://gerrit.wikimedia.org/r/476098 (https://phabricator.wikimedia.org/T208622) (owner: 10Dzahn) [18:53:29] !log Netbox: remove Napalm integration [18:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:33] (03CR) 10Herron: rsyslog: input::file add multiline handling & ship gerrit logs to ELK (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/476592 (https://phabricator.wikimedia.org/T141324) (owner: 10Herron) [18:57:21] PROBLEM - Keyholder SSH agent on netmon1002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [18:58:53] RECOVERY - puppet last run on phab1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181129T1900). [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:03:19] (03PS2) 10Ayounsi: Revert "Add fake ssh keys for netbox user" [labs/private] - 10https://gerrit.wikimedia.org/r/476584 (https://phabricator.wikimedia.org/T205898) [19:03:40] (03CR) 10Ayounsi: [V: 032 C: 032] Revert "Add fake ssh keys for netbox user" [labs/private] - 10https://gerrit.wikimedia.org/r/476584 (https://phabricator.wikimedia.org/T205898) (owner: 10Ayounsi) [19:04:00] 10Operations, 10Analytics, 10Analytics-Cluster, 10SRE-Access-Requests: Add existing group "researchers" to "hosts that can produce recommendation API dumps" (role statistics::private) - https://phabricator.wikimedia.org/T210757 (10bmansurov) @Nuria we are using Spark, Wikidata dumps in Hadoop, and some Hiv... [19:06:26] 10Operations, 10Patch-For-Review: Netbox: explore NAPALM integration - https://phabricator.wikimedia.org/T205898 (10ayounsi) All test configuration for Netbox/Napalm has been removed. [19:07:00] hello again [19:07:52] (03PS6) 10Ayounsi: Icinga: add check_vcp (part 1) [puppet] - 10https://gerrit.wikimedia.org/r/458850 (https://phabricator.wikimedia.org/T201097) [19:09:16] (03CR) 10Ayounsi: [C: 032] Icinga: add check_vcp (part 1) [puppet] - 10https://gerrit.wikimedia.org/r/458850 (https://phabricator.wikimedia.org/T201097) (owner: 10Ayounsi) [19:11:43] PROBLEM - Keyholder SSH agent on netmon2001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [19:12:21] (03CR) 10EBernhardson: [C: 031] [cirrus] prepare multi-instance services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475749 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [19:13:23] raynor: hello :) I will take of deploying your MobileFrontend patch for 1.33.0-wmf.6 :) [19:13:29] raynor: thanks for the quick fix and backport! [19:13:56] np, sorry for merging broken code earlier. I should pay bit more attention to return types [19:16:19] !log hashar@deploy1001 Synchronized php-1.33.0-wmf.6/extensions/MobileFrontend/: RecordRevision::getUser() returns UserIdentity not int - T210737 (duration: 00m 55s) [19:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:22] T210737: Various User.php: PHP Notice: Object of class User could not be converted to int - https://phabricator.wikimedia.org/T210737 [19:16:50] raynor: it happens. I will roll the train to all wikis once I am done with all the hotfixes [19:17:30] (03CR) 10Dzahn: "thanks! yea, it's to avoid any duplicate definitions when you have multiple profiles ensuring /srv/research/ is a directory." [puppet] - 10https://gerrit.wikimedia.org/r/476098 (https://phabricator.wikimedia.org/T208622) (owner: 10Dzahn) [19:20:15] (03CR) 10Dzahn: "now this is just waiting on the requested repos being created on gerrit and content moved from github" [puppet] - 10https://gerrit.wikimedia.org/r/476098 (https://phabricator.wikimedia.org/T208622) (owner: 10Dzahn) [19:21:46] (03PS1) 10Ottomata: Use refinery-job 0.0.81 for refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/476601 (https://phabricator.wikimedia.org/T210465) [19:23:07] (03CR) 10Dzahn: [C: 031] "wow, you already merged it. thanks :))" [puppet] - 10https://gerrit.wikimedia.org/r/416742 (owner: 10Dzahn) [19:23:18] (03PS1) 10Muehlenhoff: Update MOU dates for pirroh and piccardi [puppet] - 10https://gerrit.wikimedia.org/r/476602 [19:24:42] (03CR) 10EBernhardson: [C: 031] [cirrus] prepare multi-instance services (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475749 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [19:26:33] (03CR) 10Muehlenhoff: [C: 032] Update MOU dates for pirroh and piccardi [puppet] - 10https://gerrit.wikimedia.org/r/476602 (owner: 10Muehlenhoff) [19:29:17] Hi all [19:29:44] Is it just me or are other people reporting suspect login attempts at the moment? [19:30:48] (03PS1) 10Ayounsi: Icinga, assign check_vcp to all VC switches [puppet] - 10https://gerrit.wikimedia.org/r/476604 (https://phabricator.wikimedia.org/T201097) [19:33:34] Hello ShakespeareFan00! [19:33:57] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1002/13787/icinga1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/476604 (https://phabricator.wikimedia.org/T201097) (owner: 10Ayounsi) [19:33:58] Don't know; haven't gotten one [19:36:30] (03CR) 10EBernhardson: [cirrus] Add temp clusters but still write to the old ones (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475750 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [19:41:06] (03PS1) 10Muehlenhoff: Absent NfsdCollector Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/476606 (https://phabricator.wikimedia.org/T183454) [19:42:03] !log remove neodymium/sarin from mgmt routers - T210612 [19:42:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:07] T210612: Remove neodymium/sarin from router ACLs - https://phabricator.wikimedia.org/T210612 [19:43:57] (03CR) 10EBernhardson: [cirrus] Add temp clusters but still write to the old ones (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475750 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [19:45:03] (03CR) 10GTirloni: [C: 032] Absent NfsdCollector Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/476606 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [19:45:14] (03CR) 10EBernhardson: [C: 031] [cirrus] Start writing to psi & omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476271 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [19:46:43] 10Operations, 10Discovery-Search (Current work), 10Epic, 10Patch-For-Review: Migrate elasticsearch scripts to spicerack cookbooks - https://phabricator.wikimedia.org/T202885 (10debt) [19:46:47] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review: Refactor current code base to support multiple elasticsearch instances/multiple elasticsearch clusters - https://phabricator.wikimedia.org/T207918 (10debt) 05Open>03Resolved [19:49:07] 10Operations, 10netops: Remove neodymium/sarin from router ACLs - https://phabricator.wikimedia.org/T210612 (10ayounsi) 05Open>03Resolved a:03ayounsi Removed! [19:50:13] (03CR) 10EBernhardson: [C: 031] [cirrus] Start using replica group settings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [19:55:00] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Deprovision Diamond collectors no longer in use - https://phabricator.wikimedia.org/T183454 (10GTirloni) [19:55:02] (03CR) 10Dzahn: [C: 031] Icinga, assign check_vcp to all VC switches [puppet] - 10https://gerrit.wikimedia.org/r/476604 (https://phabricator.wikimedia.org/T201097) (owner: 10Ayounsi) [19:56:53] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Refactor puppet WDQS module - https://phabricator.wikimedia.org/T208201 (10debt) 05Open>03Resolved [19:58:43] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Refactor puppet WDQS module - https://phabricator.wikimedia.org/T208201 (10debt) [19:58:52] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: Refactor wdqs::gui - Separate cron tasks from the module - https://phabricator.wikimedia.org/T209257 (10debt) 05Open>03Resolved [19:59:02] (03CR) 10Ayounsi: [C: 032] Icinga, assign check_vcp to all VC switches [puppet] - 10https://gerrit.wikimedia.org/r/476604 (https://phabricator.wikimedia.org/T201097) (owner: 10Ayounsi) [19:59:09] (03PS2) 10Ayounsi: Icinga, assign check_vcp to all VC switches [puppet] - 10https://gerrit.wikimedia.org/r/476604 (https://phabricator.wikimedia.org/T201097) [20:00:04] Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181129T2000) [20:01:10] 10Operations, 10Beta-Cluster-Infrastructure: "Obama" page on Beta Cluster often responds with 503 - https://phabricator.wikimedia.org/T188913 (10Jdlrobson) [20:01:22] !log Apply Icinga:check_vcp to all VC switches - T201097 [20:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:26] T201097: Add virtual chassis port status alerting - https://phabricator.wikimedia.org/T201097 [20:01:55] (03PS2) 10Ottomata: Use refinery-job 0.0.81 for refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/476601 (https://phabricator.wikimedia.org/T210465) [20:02:08] (03CR) 10Ottomata: [C: 032] "tested and works fine" [puppet] - 10https://gerrit.wikimedia.org/r/476601 (https://phabricator.wikimedia.org/T210465) (owner: 10Ottomata) [20:02:11] (03CR) 10Ottomata: [V: 032 C: 032] Use refinery-job 0.0.81 for refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/476601 (https://phabricator.wikimedia.org/T210465) (owner: 10Ottomata) [20:07:03] PROBLEM - Juniper virtual chassis ports on asw2-c-eqiad is CRITICAL: CRIT: Down: 1 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [20:07:40] rgr! [20:16:14] robh, papaul, cmjohnson1, see above, I added an Icinga check for Virtual Chassis ports, with the following runbook: https://wikitech.wikimedia.org/wiki/Network_monitoring#VCP_status [20:16:40] 10Operations, 10ops-codfw: rack/setup/install elastic203[7-9], elastic204[0-9], elastic205[0-4] - https://phabricator.wikimedia.org/T210450 (10Papaul) [20:16:45] Basically offloading it all to DCops :) let me know if you have questions [20:16:52] thanks [20:17:47] cmjohnson1: and asw2-c-eqiad is complaining about a down port, if you want to be the 1st one to test that runbook :) [20:18:02] :) cool stuff [20:28:07] (03PS11) 10Dzahn: create profile::research::article_recommender [puppet] - 10https://gerrit.wikimedia.org/r/476098 (https://phabricator.wikimedia.org/T208622) [20:28:47] (03PS1) 10Ottomata: Blacklist mediawiki_revision_score from refine again until we fix problem [puppet] - 10https://gerrit.wikimedia.org/r/476615 (https://phabricator.wikimedia.org/T210465) [20:29:55] (03CR) 10Dzahn: [C: 04-1] "adjusted repo / dir name to match: And this one is pending: https://gerrit.wikimedia.org/r/#/admin/projects/research/article-r" [puppet] - 10https://gerrit.wikimedia.org/r/476098 (https://phabricator.wikimedia.org/T208622) (owner: 10Dzahn) [20:30:00] (03CR) 10Ottomata: [C: 032] Blacklist mediawiki_revision_score from refine again until we fix problem [puppet] - 10https://gerrit.wikimedia.org/r/476615 (https://phabricator.wikimedia.org/T210465) (owner: 10Ottomata) [20:31:20] (03PS3) 10Dzahn: cache/trafficserver: replace rutherfordium with people1001, backend and director [puppet] - 10https://gerrit.wikimedia.org/r/475236 (https://phabricator.wikimedia.org/T210036) [20:33:40] (03PS4) 10Dzahn: cache/trafficserver: replace rutherfordium with people1001, backend and director [puppet] - 10https://gerrit.wikimedia.org/r/475236 (https://phabricator.wikimedia.org/T210036) [20:33:57] (03PS5) 10Dzahn: cache/trafficserver: replace rutherfordium with people1001, backend and director [puppet] - 10https://gerrit.wikimedia.org/r/475236 (https://phabricator.wikimedia.org/T210036) [20:34:55] 10Operations, 10ops-codfw: rack/setup/install elastic203[7-9], elastic204[0-9], elastic205[0-4] - https://phabricator.wikimedia.org/T210450 (10Papaul) [20:36:11] (03CR) 10EBernhardson: [cirrus] Cleanup transitional states (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [20:36:59] (03CR) 10BryanDavis: [C: 031] deployment-prep: Try changing redis_lock entries to memc hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475025 (https://phabricator.wikimedia.org/T210030) (owner: 10Alex Monk) [20:38:11] (03CR) 10Dzahn: [C: 032] cache/trafficserver: replace rutherfordium with people1001, backend and director [puppet] - 10https://gerrit.wikimedia.org/r/475236 (https://phabricator.wikimedia.org/T210036) (owner: 10Dzahn) [20:42:42] !log people.wikimedia.org is switching backends from rutherfordium to people1001, please stand by during a short maintenance period.. data has been copied | https://wikitech.wikimedia.org/wiki/People.wikimedia.org#Backend_upgrade_November_2018 | T210036 [20:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:46] T210036: upgrade people.wm.org (rutherfordium) to stretch - https://phabricator.wikimedia.org/T210036 [20:44:40] mutante: people1001 is the best server name yet [20:45:09] (03CR) 10Dzahn: [C: 032] "watched puppet run and service refresh on cp1079, saw no issue" [puppet] - 10https://gerrit.wikimedia.org/r/475236 (https://phabricator.wikimedia.org/T210036) (owner: 10Dzahn) [20:45:35] chasemp: hehehe, i made sure to add it to the official naming standard page [20:46:14] folks can just use "people.eqiad" without remembering a number [20:46:24] for the future [20:46:26] 👍 [20:46:50] no people.codfw yet :p [20:47:56] (03PS2) 10Dzahn: switch people.eqiad from rutherfordium to people1001 [dns] - 10https://gerrit.wikimedia.org/r/475234 [20:48:23] (03CR) 10Dzahn: [C: 032] switch people.eqiad from rutherfordium to people1001 [dns] - 10https://gerrit.wikimedia.org/r/475234 (owner: 10Dzahn) [20:50:02] !log people - rsynced /home one last time, switched DNS people.eqiad CNAME over, varnish change merged (T210036) [20:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:06] T210036: upgrade people.wm.org (rutherfordium) to stretch - https://phabricator.wikimedia.org/T210036 [20:54:04] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install cloudvirtan100[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T207194 (10Ottomata) @Cmjohnson I don't know what the proper disk layout for these are, since they will be Cloud Virt nodes. I doubt RAID 0 is... [20:54:54] (03PS1) 10Dzahn: remove peopleweb role from rutherfordium [puppet] - 10https://gerrit.wikimedia.org/r/476618 (https://phabricator.wikimedia.org/T210036) [20:56:22] (03CR) 10Dzahn: [C: 032] "this removes shell access for all non-roots, the easiest way to prevent people still going to the old server" [puppet] - 10https://gerrit.wikimedia.org/r/476618 (https://phabricator.wikimedia.org/T210036) (owner: 10Dzahn) [20:57:53] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team, 10User-Smalyshev: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service - https://phabricator.wikimedia.org/T206636 (10Andrew) [20:57:57] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team: delete t206636-3 VM and revert quota bumps for project wikidata-query - https://phabricator.wikimedia.org/T207101 (10Andrew) 05Open>03Resolved [20:59:55] (03PS2) 10Dzahn: Revert "peopleweb: allow rsync of /home from rutherfordium to people1001" [puppet] - 10https://gerrit.wikimedia.org/r/475249 [21:00:51] (03CR) 10Dzahn: [C: 032] Revert "peopleweb: allow rsync of /home from rutherfordium to people1001" [puppet] - 10https://gerrit.wikimedia.org/r/475249 (owner: 10Dzahn) [21:01:51] 10Operations, 10Analytics, 10Security-Team, 10WMF-Legal, 10Software-Licensing: Can exfat be used in WMF production? - https://phabricator.wikimedia.org/T210667 (10Jrogers-WMF) Hi all, commenting on this from WMF Legal. As I understand the question and context, the issue is using a proprietary format fo... [21:03:18] (03CR) 10Dzahn: [C: 032] "i did not "absent" it but there was no point when rutherfordium had the rsyncd config and will be deleted entirely" [puppet] - 10https://gerrit.wikimedia.org/r/475249 (owner: 10Dzahn) [21:06:50] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install cloudvirtan100[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T207194 (10Andrew) >>! In T207194#4787008, @Ottomata wrote: > @Cmjohnson I don't know what the proper disk layout for these are, since they wil... [21:06:53] (03PS1) 10MusikAnimal: Use log channel 'AbuseFilter' instead of 'AbuseFilterSlow' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476620 (https://phabricator.wikimedia.org/T210636) [21:07:11] (03PS2) 10Dzahn: remove rutherfordium from site, netboot, DHCP [puppet] - 10https://gerrit.wikimedia.org/r/475237 (https://phabricator.wikimedia.org/T210036) [21:07:35] Anybody to create a wiki account to complete T204477? [21:07:36] T204477: Create punjabi.wikimedia.org for Punjabi Wikimedians User Group - https://phabricator.wikimedia.org/T204477 [21:07:59] ping Reedy, thcipriani, no_justification ^^ [21:08:43] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install cloudvirtan100[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T207194 (10Ottomata) Hmm, ok, then I think in this case RAID 0 is fine. Since these will have Hadoop, data will be replicated across nodes 3x... [21:10:00] (03CR) 10Krinkle: [C: 032] Use log channel 'AbuseFilter' instead of 'AbuseFilterSlow' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476620 (https://phabricator.wikimedia.org/T210636) (owner: 10MusikAnimal) [21:10:29] * Krinkle performs to acquire lock on mwdebug1002 [21:11:05] (03Merged) 10jenkins-bot: Use log channel 'AbuseFilter' instead of 'AbuseFilterSlow' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476620 (https://phabricator.wikimedia.org/T210636) (owner: 10MusikAnimal) [21:11:20] musikanimal: staging now on mwdebug1002 [21:11:45] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install cloudvirtan100[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T207194 (10Andrew) As I understand it, with raid 0 if a single drive dies the whole system (and containing VM) will have to be rebuilt. Also I... [21:11:46] thanks. Testing in progress [21:12:04] k, sync is done. [21:13:22] musikanimal: so, one random little thing about logstash, I'd recommend editing the first filter bubble on that link and [x]-ing 1001 to clear the log of unrelated messages [21:14:05] mkay [21:15:36] hmm the X-Wikimedia-Debug Chrome extension isn't working [21:16:11] I see tiny flyout when I click the button, not the full thing where I can select mwdebug1002, etc. [21:17:37] hm.. sometimes it takes a few tries to close/re-open. There's a race condition. [21:19:33] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [21:20:27] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [21:21:08] 10Operations, 10ops-eqiad, 10netops: faulty VC link on asw2-c-eqiad - https://phabricator.wikimedia.org/T210788 (10ayounsi) p:05Triage>03High [21:22:02] (03CR) 10jenkins-bot: Use log channel 'AbuseFilter' instead of 'AbuseFilterSlow' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476620 (https://phabricator.wikimedia.org/T210636) (owner: 10MusikAnimal) [21:22:24] got it to work, had to restart my browser [21:22:38] but now I'm having trouble making the filter slow enough! [21:23:31] (03CR) 10Vgutierrez: [C: 032] gerrit: Switch between old LE puppetization and certcentral using hiera [puppet] - 10https://gerrit.wikimedia.org/r/476459 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [21:23:39] (03PS5) 10Vgutierrez: gerrit: Switch between old LE puppetization and certcentral using hiera [puppet] - 10https://gerrit.wikimedia.org/r/476459 (https://phabricator.wikimedia.org/T207050) [21:26:20] I think there's some db caching or something going on. I've got a really awfully written filter here and it's still going really fast [21:27:38] Krinkle: I think we're just going to have to hope for the best [21:28:34] I could make the AbuseFilter target all editors, or a wider range, but I don't want to slow down their editing just to test this [21:29:54] I do see the AbuseFilter cache hits/misses in logstash, so I know I'm looking at the right thing, and that X-Wikimedia-Debug is working, etc. [21:34:32] !log removed unused vc-port on asw2-c-eqiad:fpc8 - T210788 [21:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:36] T210788: faulty VC link on asw2-c-eqiad - https://phabricator.wikimedia.org/T210788 [21:35:03] RECOVERY - Juniper virtual chassis ports on asw2-c-eqiad is OK: OK: UP: 22 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [21:36:06] (03CR) 10Dzahn: [C: 032] remove rutherfordium from site, netboot, DHCP [puppet] - 10https://gerrit.wikimedia.org/r/475237 (https://phabricator.wikimedia.org/T210036) (owner: 10Dzahn) [21:36:14] 10Operations, 10ops-eqiad, 10netops: faulty VC link on asw2-c-eqiad - https://phabricator.wikimedia.org/T210788 (10ayounsi) 05Open>03Resolved a:05Cmjohnson>03ayounsi That was actually an unused port. [21:36:17] (03PS3) 10Dzahn: remove rutherfordium from site, netboot, DHCP [puppet] - 10https://gerrit.wikimedia.org/r/475237 (https://phabricator.wikimedia.org/T210036) [21:41:30] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, and 2 others: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10daniel) We discussed this again in the TechCom meeting the other day. If DBAs are ok with not just the new field and indexes, bu... [21:41:59] !log changing email for User:Mathounette [21:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:03] musikanimal: I'd say it's fine to enable on testwiki for more editors, no problem. [21:42:15] sorry for the delay, was distracted :) [21:42:40] (03PS2) 10Dzahn: remove rutherfordium.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/475235 (https://phabricator.wikimedia.org/T210036) [21:45:13] no problem. I've created https://test.wikipedia.org/wiki/Special:AbuseFilter/189 [21:45:33] that would normally be really, really slow [21:47:06] 10Operations, 10ops-codfw: rack/setup/install elastic203[7-9], elastic204[0-9], elastic205[0-4] - https://phabricator.wikimedia.org/T210450 (10Papaul) [21:49:52] Krinkle: err wait, slow filter runtimes won't be reported for other editors, since they're not on mwdebug1002, right? [21:49:53] musikanimal: So.. why would it be slow? [21:50:06] musikanimal: That is indeed also true. [21:50:50] I tried editing [[Barack Obama]] on testwiki, large article. No "slow filter" entry in logstash [21:51:10] I think there's a lot of conditions that need to be met for a filter to have a slow run time [21:51:20] in production, across all wikis, there were only 50 or so a day [21:52:51] hard to say if it's actually working or not [21:52:56] OK. I'll roll it out then [21:53:01] It's only a log channel anywah. [21:53:12] We're not actually potentially making anything slow. [21:53:13] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:53:14] yeah, and it wasn't working before, so it can't be any worse [21:53:31] :) [21:53:39] thanks! [21:53:40] all 1.33.0-wmf.6 blockers have been fixed or ruled out. So we can process with the last group [21:53:52] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, and 2 others: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) @Marostegui Hello! I've added a few summary columns and indexes to the link tables, and the resulting DDL would look li... [21:54:22] !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T210636 - I9ebbc625f98c314 (duration: 00m 55s) [21:54:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:25] T210636: Slow filters logstash dashboard no longer being updated - https://phabricator.wikimedia.org/T210636 [21:54:36] * Krinkle performs ritual to release lock on mwdebug1002 [21:55:13] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 1.570 second response time [21:55:47] Krinkle: let me know when you are done, i will resume the train next :) [21:56:29] * Krinkle is done [21:59:47] good [22:00:42] (03PS1) 10Hashar: all wikis to 1.33.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476764 [22:00:44] (03CR) 10Hashar: [C: 032] all wikis to 1.33.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476764 (owner: 10Hashar) [22:01:52] (03Merged) 10jenkins-bot: all wikis to 1.33.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476764 (owner: 10Hashar) [22:01:53] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:03:20] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.33.0-wmf.6 [22:04:24] hashar@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [22:04:44] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.33.0-wmf.6 [22:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:51] who knows [22:05:09] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.096 second response time [22:06:56] bah [22:07:11] bunch of memcached error keys due to the servers on port 11213 having a timeout [22:07:30] I am afraid a new version ends up causing a large denial of service on our memcached relay :// [22:08:17] (03PS3) 10Andrew Bogott: deployment-prep: move lists of cache nodes out of labs.yaml hiera [puppet] - 10https://gerrit.wikimedia.org/r/475225 (owner: 10Alex Monk) [22:08:19] (03PS4) 10Andrew Bogott: deployment-prep: Clean up from cache-text04 -> cache-text05 migration [puppet] - 10https://gerrit.wikimedia.org/r/475227 (owner: 10Alex Monk) [22:09:00] (03CR) 10Andrew Bogott: [C: 032] deployment-prep: move lists of cache nodes out of labs.yaml hiera [puppet] - 10https://gerrit.wikimedia.org/r/475225 (owner: 10Alex Monk) [22:09:39] (03CR) 10Andrew Bogott: [C: 032] deployment-prep: Clean up from cache-text04 -> cache-text05 migration [puppet] - 10https://gerrit.wikimedia.org/r/475227 (owner: 10Alex Monk) [22:10:59] (03CR) 10jenkins-bot: all wikis to 1.33.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476764 (owner: 10Hashar) [22:11:19] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:14:54] 10Operations, 10ops-codfw: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10colewhite) Saw a crash happen today Thu Nov 29 at 22:10Z [22:21:23] !log 1.33.0-wmf.6 is on all wikis and looks stable. [22:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:29] train is complete [22:29:10] 10Operations, 10ops-codfw: rack/setup/install elastic203[7-9], elastic204[0-9], elastic205[0-4] - https://phabricator.wikimedia.org/T210450 (10Papaul) [22:32:47] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 7.121 second response time [22:51:37] (03CR) 10Paladox: "Breaks in the cloud with:" [puppet] - 10https://gerrit.wikimedia.org/r/476459 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [22:54:55] (03CR) 10Cwhite: [C: 032] initial commit [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/471298 (https://phabricator.wikimedia.org/T208066) (owner: 10Cwhite) [22:57:53] (03PS1) 10EBernhardson: Update wbsearchentities ab test configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476772 (https://phabricator.wikimedia.org/T209402) [22:59:17] 10Operations, 10Citoid, 10Services (watching), 10VisualEditor (Current work): Decreased internationalisation of automatic citations as a result of switch to new translation-server - https://phabricator.wikimedia.org/T210806 (10Mvolz) 05Open>03stalled p:05Triage>03Normal [23:01:35] 10Operations, 10monitoring, 10User-CDanis: graph server temperature metrics - https://phabricator.wikimedia.org/T209863 (10CDanis) Things I have learned today: Using the labels provided by node_hwmon_sensor_labels is not that hard... However, if you write this: ` node_hwmon_temp_celsius{instance=~"$serve... [23:03:00] (03CR) 10Paladox: "Oh nvm, it broke because of https://gerrit.wikimedia.org/r/c/operations/puppet/+/475225" [puppet] - 10https://gerrit.wikimedia.org/r/476459 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [23:03:15] (03CR) 10Dzahn: ""it's better to break any such projects loudly rather than silently." yea, i think it worked and paladox noticed and they do exist" [puppet] - 10https://gerrit.wikimedia.org/r/475225 (owner: 10Alex Monk) [23:05:38] (03CR) 10Alex Monk: "Interesting. What was he using that had the deployment-prep cache hiera stuff?" [puppet] - 10https://gerrit.wikimedia.org/r/475225 (owner: 10Alex Monk) [23:07:03] paladox: ^ please add :) [23:07:15] (03CR) 10Paladox: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/475225 (owner: 10Alex Monk) [23:07:26] done [23:08:05] (03CR) 10Alex Monk: "Interesting. I suppose in that case you probably want to trust the novaproxy hosts instead of deployment-cache-* ?" [puppet] - 10https://gerrit.wikimedia.org/r/475225 (owner: 10Alex Monk) [23:12:40] (03PS3) 10Bstorm: sonofgridengine: set up shadow_master profile [puppet] - 10https://gerrit.wikimedia.org/r/476430 (https://phabricator.wikimedia.org/T200557) [23:26:46] !log puppetmaster: sudo puppet cert revoke rutherfordium.eqiad.wmnet; sudo puppet node clean rutherfordium.eqiad.wmnet ; sudo puppet node deactivate rutherfordium.eqiad.wmnet ; run puppet on icinga1001.. removed host from monitoring (decom for ganeti VM) (T210036) [23:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:50] anomie: maybe you can create gerrit repos? [23:26:51] T210036: upgrade people.wm.org (rutherfordium) to stretch - https://phabricator.wikimedia.org/T210036 [23:31:48] 10Operations, 10Analytics, 10Analytics-Cluster, 10SRE-Access-Requests: Add existing group "researchers" to "hosts that can produce recommendation API dumps" (role statistics::private) - https://phabricator.wikimedia.org/T210757 (10Nuria) @bmansurov We do not recommend to generate these in stats boxes, stat... [23:32:35] 10Operations, 10vm-requests: upgrade people.wm.org (rutherfordium) to stretch - https://phabricator.wikimedia.org/T210036 (10Dzahn) [23:33:30] 10Operations, 10vm-requests: upgrade people.wm.org (rutherfordium) to stretch - https://phabricator.wikimedia.org/T210036 (10Dzahn) [23:34:45] 10Operations, 10vm-requests: upgrade people.wm.org (rutherfordium) to stretch - https://phabricator.wikimedia.org/T210036 (10Dzahn) [23:35:39] 10Operations, 10Analytics, 10Analytics-Cluster, 10SRE-Access-Requests: Add existing group "researchers" to "hosts that can produce recommendation API dumps" (role statistics::private) - https://phabricator.wikimedia.org/T210757 (10Nuria) @Dzahn let's please hold on on any changes, stats boxes are mean for... [23:39:10] 10Operations, 10Analytics, 10Analytics-Cluster, 10SRE-Access-Requests: Add existing group "researchers" to "hosts that can produce recommendation API dumps" (role statistics::private) - https://phabricator.wikimedia.org/T210757 (10Dzahn) @Nuria understood! thank you for your prompt comments and don't worry... [23:39:56] 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10Dzahn) [23:40:01] 10Operations, 10Analytics, 10Analytics-Cluster, 10SRE-Access-Requests: Add existing group "researchers" to "hosts that can produce recommendation API dumps" (role statistics::private) - https://phabricator.wikimedia.org/T210757 (10Dzahn) 05Open>03stalled [23:44:38] 10Operations, 10Analytics, 10Analytics-Cluster, 10SRE-Access-Requests: Add existing group "researchers" to "hosts that can produce recommendation API dumps" (role statistics::private) - https://phabricator.wikimedia.org/T210757 (10bmansurov) @Nuria OK, that makes sense. I'll work with #analytics on this. @... [23:45:54] 10Operations, 10Analytics, 10Analytics-Cluster, 10SRE-Access-Requests: Add existing group "researchers" to "hosts that can produce recommendation API dumps" (role statistics::private) - https://phabricator.wikimedia.org/T210757 (10Dzahn) @Nuria I also have this pending gerrit change that i will put on hold... [23:48:31] 10Operations, 10Analytics, 10Analytics-Cluster, 10SRE-Access-Requests: Add existing group "researchers" to "hosts that can produce recommendation API dumps" (role statistics::private) - https://phabricator.wikimedia.org/T210757 (10Nuria) @Dzahn I see, Let's abandon that change. Stats machines are used by... [23:50:04] (03Abandoned) 10Dzahn: create profile::research::article_recommender [puppet] - 10https://gerrit.wikimedia.org/r/476098 (https://phabricator.wikimedia.org/T208622) (owner: 10Dzahn) [23:51:04] 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10Dzahn) [23:51:06] 10Operations, 10Analytics, 10Analytics-Cluster, 10SRE-Access-Requests: Add existing group "researchers" to "hosts that can produce recommendation API dumps" (role statistics::private) - https://phabricator.wikimedia.org/T210757 (10Dzahn) 05stalled>03Invalid Ok Nuria! makes sense. I abandoned the change... [23:54:29] 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10Dzahn) see T210757#4786370 for the latest status. things have changed since Nuria pointed out hadoop should be used instead. [23:58:13] 10Operations, 10Analytics, 10Analytics-Cluster, 10SRE-Access-Requests: Add existing group "researchers" to "hosts that can produce recommendation API dumps" (role statistics::private) - https://phabricator.wikimedia.org/T210757 (10Nuria) Many thanks to everyone for the prompt responses.