[00:05:17] PROBLEM - puppet last run on elastic1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:12:12] (03PS1) 10Jforrester: New wikitext editor: Enable the Beta Feature in Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311877 [00:30:20] RECOVERY - puppet last run on elastic1028 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [00:57:39] PROBLEM - puppet last run on krypton is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:12:01] (03PS29) 1020after4: Scap swat command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) [01:13:37] (03PS30) 1020after4: Scap swat command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) [01:13:56] (03PS31) 1020after4: Scap swat command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) [01:14:29] (03CR) 1020after4: Scap swat command (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) (owner: 1020after4) [01:22:38] RECOVERY - puppet last run on krypton is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:45:42] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1002 / webrequest logs for MelodyKramer - https://phabricator.wikimedia.org/T145387#2654729 (10Dzahn) [01:47:55] (03PS5) 10Dzahn: admin: access to stats1002/webrequest logs for Melody Kramer [puppet] - 10https://gerrit.wikimedia.org/r/311400 (https://phabricator.wikimedia.org/T145387) (owner: 10ArielGlenn) [01:49:47] PROBLEM - Postgres Replication Lag on maps-test2004 is CRITICAL: CRITICAL - Rep Delay is: 1802.846216 Seconds [01:50:04] (03CR) 10Dzahn: "amended and renamed so it's just stat1002/webrequest logs access via the statistics-privatedata-users group as stated on ticket" [puppet] - 10https://gerrit.wikimedia.org/r/311400 (https://phabricator.wikimedia.org/T145387) (owner: 10ArielGlenn) [01:50:46] (03PS6) 10Dzahn: admin: access to stats1002/webrequest logs for Melody Kramer [puppet] - 10https://gerrit.wikimedia.org/r/311400 (https://phabricator.wikimedia.org/T145387) (owner: 10ArielGlenn) [01:50:52] (03CR) 10Dzahn: [C: 032] admin: access to stats1002/webrequest logs for Melody Kramer [puppet] - 10https://gerrit.wikimedia.org/r/311400 (https://phabricator.wikimedia.org/T145387) (owner: 10ArielGlenn) [01:51:52] !log thumbor servers ran out of disk space [01:51:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:52:18] RECOVERY - Postgres Replication Lag on maps-test2004 is OK: OK - Rep Delay is: 123.584916 Seconds [02:11:21] !log thumbor1001/1002 - moved logs from /var/log/thumbor to /srv/thumborlogs to free some space, the actual issue is in /tmp though. lots of systemd-private-* dirs with large sizes. like https://bugzilla.redhat.com/show_bug.cgi?id=1183684 ? [02:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:11:49] PROBLEM - puppet last run on restbase1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:15:31] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2495710 (10Dzahn) thumbor1001/1002 ran out of disk space. 100% full. was alerted via Icinga 19:11 < mutante> !log thumbor1001/1002 - moved logs from /var... [02:16:39] PROBLEM - Disk space on thumbor1002 is CRITICAL: DISK CRITICAL - free space: / 1687 MB (3% inode=77%) [02:17:28] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2654782 (10Dzahn) 35G tmp while /dev/md0 is just 46G same situation on both servers [02:24:18] RECOVERY - puppet last run on restbase1013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:27:20] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2654800 (10Dzahn) We have LVM but no free extents to resize / they are all used for /srv already. And all of these directories in /tmp actually belong to o... [02:29:50] RECOVERY - Disk space on thumbor1001 is OK: DISK OK [02:30:51] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2654805 (10Dzahn) on thumbor1001, more disk space got freed by itself a couple minutes later.. then: 19:30 < icinga-wm> RECOVERY - Disk space on thumbor1001... [02:34:57] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1002 / webrequest logs for MelodyKramer - https://phabricator.wikimedia.org/T145387#2654806 (10Dzahn) Hi @MelodyKramer your user has been created on stat1002 (I added you to the statistics-privatedata-users group which gives ac... [02:35:22] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1002 / webrequest logs for MelodyKramer - https://phabricator.wikimedia.org/T145387#2654808 (10Dzahn) 05Open>03Resolved a:03Dzahn [02:38:30] 06Operations, 10Ops-Access-Requests: Requesting access to stat1002 / webrequest logs for MelodyKramer - https://phabricator.wikimedia.org/T145387#2654812 (10Dzahn) [02:39:12] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.18) (duration: 16m 28s) [02:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:39:38] PROBLEM - puppet last run on wtp2013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:41:38] PROBLEM - Disk space on thumbor1002 is CRITICAL: DISK CRITICAL - free space: / 1652 MB (3% inode=76%) [02:45:26] !log thumbor1002 moved nginx access logs to /srv for more space on / [02:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:46:23] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Sep 21 02:46:22 UTC 2016 (duration 7m 11s) [02:46:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:56:30] RECOVERY - Disk space on thumbor1002 is OK: DISK OK [03:04:40] RECOVERY - puppet last run on wtp2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:10:17] (03Abandoned) 10BBlack: LRU_Fail debugging [debs/varnish4] - 10https://gerrit.wikimedia.org/r/310540 (owner: 10BBlack) [03:12:36] (03CR) 10BBlack: [C: 031] varnish/htcppurger: don't use ensure => latest [puppet] - 10https://gerrit.wikimedia.org/r/310895 (https://phabricator.wikimedia.org/T115348) (owner: 10Dzahn) [03:13:49] PROBLEM - Varnishkafka log producer on cp1064 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [03:16:37] (03CR) 10BBlack: [C: 031] Improve resilience during varnish (re)starts [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/311415 (https://phabricator.wikimedia.org/T138747) (owner: 10Elukey) [03:16:50] PROBLEM - puppet last run on elastic2019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:17:01] (03CR) 10BBlack: [C: 031] varnish: add varnish-fe restart script [puppet] - 10https://gerrit.wikimedia.org/r/311387 (owner: 10Ema) [03:18:45] (03CR) 10BBlack: [C: 031] check_ssl: Use a maximum percentage of certificate validity time for determining alert state [puppet] - 10https://gerrit.wikimedia.org/r/309203 (https://phabricator.wikimedia.org/T144293) (owner: 10Alex Monk) [03:28:39] RECOVERY - Varnishkafka log producer on cp1064 is OK: PROCS OK: 1 process with command name varnishkafka [03:41:44] RECOVERY - puppet last run on elastic2019 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [03:54:35] !log krinkle@tin Synchronized php-1.28.0-wmf.19/resources/src/mediawiki/mediawiki.requestIdleCallback.js: I221cd6c2b (duration: 00m 47s) [03:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:56:39] !log krinkle@tin Synchronized php-1.28.0-wmf.18/resources/src/mediawiki/mediawiki.requestIdleCallback.js: I221cd6c2b (duration: 00m 48s) [03:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:58:05] !log krinkle@tin Synchronized php-1.28.0-wmf.18/resources/src/mediawiki/mediawiki.js: I221cd6c2b (duration: 00m 46s) [03:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:22:41] PROBLEM - puppet last run on db1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:47:49] RECOVERY - puppet last run on db1016 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:12:00] (03CR) 10Giuseppe Lavagetto: [C: 032] Release 0.3.1 [software/conftool] - 10https://gerrit.wikimedia.org/r/311729 (owner: 10Giuseppe Lavagetto) [06:12:31] PROBLEM - puppet last run on cp4001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:21:06] !log removing aqs100[123] from live traffic - aqs.svc.eqiad.wmnet - T144497 [06:21:07] T144497: Switch AQS to new cluster - https://phabricator.wikimedia.org/T144497 [06:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:31:19] (03PS2) 10Giuseppe Lavagetto: base::puppet: add ca_server setting when needed [puppet] - 10https://gerrit.wikimedia.org/r/310497 [06:35:28] (03CR) 10Giuseppe Lavagetto: [C: 032] base::puppet: add ca_server setting when needed [puppet] - 10https://gerrit.wikimedia.org/r/310497 (owner: 10Giuseppe Lavagetto) [06:36:15] 06Operations, 10DBA: Drop PovWatch extension-related database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54924#2655144 (10Marostegui) There is a backup for these tables at: ``` root@dbstore1001:/srv/tmp/povwatch_tables# ls -lh total 12K -rw-r--r-- 1 root root 860 Sep 13 13:47 s1_povwatc... [06:37:52] RECOVERY - puppet last run on cp4001 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:39:52] (03PS4) 10Giuseppe Lavagetto: scap: introduce scap_source type [puppet] - 10https://gerrit.wikimedia.org/r/308973 [06:43:24] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me. I'll test with two parallel reimaging runs later on." [puppet] - 10https://gerrit.wikimedia.org/r/311701 (https://phabricator.wikimedia.org/T143536) (owner: 10Volans) [06:53:12] (03CR) 10Elukey: [C: 031] Reimage: minor improvements [puppet] - 10https://gerrit.wikimedia.org/r/311701 (https://phabricator.wikimedia.org/T143536) (owner: 10Volans) [06:59:11] !log dropping tables in S1,S3,S4 - T54924 [06:59:13] T54924: Drop PovWatch extension-related database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54924 [06:59:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:09:17] PROBLEM - puppet last run on ms-fe2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:15:43] 06Operations, 10DBA: Drop PovWatch extension-related database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54924#2655161 (10Marostegui) The renamed tables have been removed from all the hosts across S1, S3 and S4 [07:15:52] 06Operations, 10DBA: Drop PovWatch extension-related database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54924#2655162 (10Marostegui) 05Open>03Resolved [07:18:16] (03PS5) 10Giuseppe Lavagetto: scap: introduce scap_source type [puppet] - 10https://gerrit.wikimedia.org/r/308973 [07:18:39] !log reimaging mw1170-mw1172 to jessie [07:18:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:19:29] !log Moved some hhvm logs (/var/log/hhvm) from root:adm to www-data:www-data on mw127[678] to remove cronspam (T132324) [07:19:30] T132324: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324 [07:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:21:50] 06Operations, 10DBA: dbstore2001: crashed - https://phabricator.wikimedia.org/T146259#2655173 (10Marostegui) [07:22:21] 06Operations, 10DBA: dbstore2001: crashed - https://phabricator.wikimedia.org/T146259#2655185 (10Marostegui) [07:23:31] 06Operations, 10DBA: dbstore2001: crashed - https://phabricator.wikimedia.org/T146259#2655192 (10jcrespo) [07:32:52] (03PS6) 10Giuseppe Lavagetto: scap: introduce scap_source type [puppet] - 10https://gerrit.wikimedia.org/r/308973 [07:33:26] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:33:35] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2655211 (10Gilles) [07:33:38] 06Operations, 06Performance-Team, 10Thumbor: Make the 100MB+ test files downloaded from their source instead of being in the git repo - https://phabricator.wikimedia.org/T145785#2655210 (10Gilles) 05Open>03Resolved [07:35:38] RECOVERY - puppet last run on ms-fe2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:35:50] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2655233 (10Gilles) It looks like temp files might not get cleared up after being created. Unfortunately those temp folders are owner by root and I can't look... [07:36:51] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2655235 (10Gilles) The existence of these temp folders is expected, it's just that the latest update to the thumbor plugins I wrote is probably causing a tem... [07:41:51] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2655241 (10Gilles) OK, I know what's going on. Every file that's downloaded and later hits an error during conversion won't get cleaned up. Resulting in the... [07:43:25] 06Operations, 06Performance-Team, 10Thumbor: Temp files not cleaned up on conversion error - https://phabricator.wikimedia.org/T146262#2655244 (10Gilles) [07:49:35] PROBLEM - configured eth on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:49:36] PROBLEM - DPKG on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:49:58] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1002.eqiad.wmnet because of too many down! [07:49:59] PROBLEM - DPKG on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:49:59] PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:50:08] PROBLEM - PyBal backends health check on lvs1012 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1002.eqiad.wmnet because of too many down! [07:50:20] PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:50:38] PROBLEM - MD RAID on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:50:57] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1002.eqiad.wmnet because of too many down! [07:51:22] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 4.980 second response time [07:51:24] PROBLEM - puppet last run on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:52:03] RECOVERY - configured eth on thumbor1001 is OK: OK - interfaces up [07:52:06] <_joe_> wow [07:52:17] <_joe_> both machines were heavily overloaded [07:52:34] RECOVERY - DPKG on thumbor1002 is OK: All packages OK [07:52:43] taking a look too [07:52:44] RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient [07:53:55] RECOVERY - puppet last run on thumbor1001 is OK: OK: Puppet is currently enabled, last run 17 minutes ago with 0 failures [07:54:36] RECOVERY - DPKG on thumbor1001 is OK: All packages OK [07:54:55] RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:55:35] RECOVERY - MD RAID on thumbor1002 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [07:55:56] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [07:57:56] 06Operations, 10DBA: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2655281 (10MoritzMuehlenhoff) I noticed T141756 which could be related (since db1082 also has the hardware and the oops looked I/O controller-related) [07:58:25] not sure yet exactly what's up with thumbor dying, cc gilles [08:00:42] (03PS2) 10Alexandros Kosiaris: puppetmaster: servermon report handler [puppet] - 10https://gerrit.wikimedia.org/r/311738 [08:01:00] ah looks deliberate [08:02:07] (03CR) 10jenkins-bot: [V: 04-1] puppetmaster: servermon report handler [puppet] - 10https://gerrit.wikimedia.org/r/311738 (owner: 10Alexandros Kosiaris) [08:02:41] <_joe_> akosiaris: should I take a look? [08:02:45] (03PS3) 10Alexandros Kosiaris: puppetmaster: servermon report handler [puppet] - 10https://gerrit.wikimedia.org/r/311738 [08:03:18] 06Operations, 10DBA: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2655283 (10Marostegui) Interesting...feel free to upgrade that firmware if you want. The box isn't pooled yet. [08:04:34] (03PS2) 10Elukey: Remove jobrunner01 from deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/311717 (https://phabricator.wikimedia.org/T144006) [08:04:48] (03CR) 10jenkins-bot: [V: 04-1] puppetmaster: servermon report handler [puppet] - 10https://gerrit.wikimedia.org/r/311738 (owner: 10Alexandros Kosiaris) [08:06:28] _joe_: yes, lemme make jenkins happy first though. Should take 5 mins [08:06:31] I 'll ping you [08:07:27] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [08:07:45] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 241 bytes in 0.046 second response time [08:07:56] RECOVERY - PyBal backends health check on lvs1012 is OK: PYBAL OK - All pools are healthy [08:08:07] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [08:08:12] I've silenced the paging alert for thumbor btw, in case this happens again [08:08:15] brb [08:09:49] (03CR) 10Elukey: [C: 032] Remove jobrunner01 from deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/311717 (https://phabricator.wikimedia.org/T144006) (owner: 10Elukey) [08:10:45] 06Operations, 10DBA: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2655291 (10Marostegui) I will coordinate with @Cmjohnson to get this upgraded before we repool it back [08:11:59] 06Operations, 10Beta-Cluster-Infrastructure, 07HHVM, 13Patch-For-Review: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#2655298 (10elukey) [08:12:57] (03PS4) 10Muehlenhoff: Beta: change deployment-mira02 to deployment-mira [puppet] - 10https://gerrit.wikimedia.org/r/311760 (https://phabricator.wikimedia.org/T144578) (owner: 10Thcipriani) [08:14:54] (03CR) 10Muehlenhoff: [C: 032] Beta: change deployment-mira02 to deployment-mira [puppet] - 10https://gerrit.wikimedia.org/r/311760 (https://phabricator.wikimedia.org/T144578) (owner: 10Thcipriani) [08:16:05] json_decode() error (4): Syntax error: @todo more info [08:16:07] ahaha [08:16:10] got it on beta [08:19:45] (03PS4) 10Alexandros Kosiaris: puppetmaster: servermon report handler [puppet] - 10https://gerrit.wikimedia.org/r/311738 [08:20:42] 06Operations, 10ops-eqiad: mw1172 stuck after reboot - https://phabricator.wikimedia.org/T146263#2655305 (10MoritzMuehlenhoff) [08:21:17] _joe_: ^ ready for your review [08:21:32] <_joe_> akosiaris: I actually had an idea about that :P [08:21:45] ACKNOWLEDGEMENT - Host mw1172 is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T146263 [08:21:47] (03PS2) 10Volans: db-eqiad.php: Temporarily depool db1086. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311685 (https://phabricator.wikimedia.org/T141951) (owner: 10Marostegui) [08:22:02] _joe_: ? [08:22:33] btw, I am gonna reimage nihal today, is that ok with ya ? [08:22:58] <_joe_> yeah let's do it as soon as possible [08:23:04] <_joe_> I'd like to migrate today [08:23:08] INFO: Unable to find facts for host puppetmaster1002.eqiad.wmnet, skipping [08:23:09] ? [08:23:16] <_joe_> where is that? [08:23:19] PCC [08:23:27] <_joe_> needs refreshing? [08:23:32] ./utils/pcc 311738 palladium.eqiad.wmnet,puppetmaster1002.eqiad.wmnet,puppetmaster1001.eqiad.wmnet,puppetmaster2001.codfw.wmnet,puppetmaster2002.cpdfw.wmnet,rhodium.eqiad.wmnet [08:23:36] even for palladium... [08:23:43] INFO: Unable to find facts for host palladium.eqiad.wmnet, skipping [08:23:46] * akosiaris looking [08:23:47] <_joe_> uhm [08:23:57] <_joe_> I might have screwed up permissions earlier [08:23:58] <_joe_> sorry [08:24:06] shame on you [08:24:08] <_joe_> I'll fix [08:24:09] :P [08:24:16] (03CR) 10Volans: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311685 (https://phabricator.wikimedia.org/T141951) (owner: 10Marostegui) [08:25:01] (03PS2) 10Volans: Reimage: minor improvements [puppet] - 10https://gerrit.wikimedia.org/r/311701 (https://phabricator.wikimedia.org/T143536) [08:25:01] <_joe_> akosiaris: try again? [08:25:22] * akosiaris trying [08:25:40] <_joe_> akosiaris: actually, I did run pcc after I did my potential screwup and it worked well [08:25:45] <_joe_> so I guess what's happening is [08:26:05] it's working now btw [08:26:23] <_joe_> ok, that's not really explicable, but :P [08:26:42] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Temporarily depool db1086. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311685 (https://phabricator.wikimedia.org/T141951) (owner: 10Marostegui) [08:26:54] (03CR) 10Volans: [C: 032] Reimage: minor improvements [puppet] - 10https://gerrit.wikimedia.org/r/311701 (https://phabricator.wikimedia.org/T143536) (owner: 10Volans) [08:27:09] <_joe_> akosiaris: so, my idea is [08:27:12] (03Merged) 10jenkins-bot: db-eqiad.php: Temporarily depool db1086. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311685 (https://phabricator.wikimedia.org/T141951) (owner: 10Marostegui) [08:27:18] <_joe_> instead of using yaml files on the FS [08:27:31] <_joe_> which are not necessarily going to be updated on the master [08:27:38] <_joe_> sorry, the frontend [08:27:43] <_joe_> where reports get compiled [08:27:55] <_joe_> we could use e.g. curl -G https://nihal.codfw.wmnet/v3/nodes/palladium.eqiad.wmnet/facts [08:28:33] <_joe_> it returns a json list [08:29:05] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depooling db1086 for an alter table - T141951 (duration: 00m 49s) [08:29:06] T141951: Add local_user_id and global_user_id fields to localuser table in centralauth database - https://phabricator.wikimedia.org/T141951 [08:29:08] PROBLEM - puppet last run on elastic2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:30:55] !log reimagining mw1196-7 to jessie [08:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:31:03] volans --^ [08:31:30] !log schema change on S7 - T141951 [08:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:31:41] elukey: just updated wmf-atuo-reimage on neodymium [08:31:46] you'll get the lastest one ;) [08:32:02] (03PS4) 10ArielGlenn: jobrunner: log rotate jobchron.log [puppet] - 10https://gerrit.wikimedia.org/r/311750 (https://phabricator.wikimedia.org/T96132) (owner: 10Hashar) [08:32:20] yep I cced you for this reason :) [08:32:23] _joe_: actually look at my change. reports are now no longer sent to the master only [08:32:33] er, s/master/frontend/ [08:33:27] but I am not against making the handler better ofc. [08:33:32] (03CR) 10ArielGlenn: [C: 032] jobrunner: log rotate jobchron.log [puppet] - 10https://gerrit.wikimedia.org/r/311750 (https://phabricator.wikimedia.org/T96132) (owner: 10Hashar) [08:35:09] <_joe_> akosiaris: no I am saying just the frontend handles the reports [08:35:34] <_joe_> if you look at the puppet::web_frontend template [08:36:23] 06Operations, 10ops-eqiad: Broken disk on copper - https://phabricator.wikimedia.org/T144261#2655345 (10fgiunchedi) I've announced the reimage of copper on ops@ for the October 3rd week. Reimage because of SSDs to be installed in {T130759}, though if there is a spare spinning disk to be installed in the mean t... [08:38:02] _joe_: that's what I am saying. look at https://gerrit.wikimedia.org/r/#/c/311738/4/modules/puppetmaster/templates/web-frontend.conf.erb. I am removing that [08:38:51] it used to be like that because reports = store, but we never did anything with that, so I am just setting it to puppetdb,servermon so any puppetmaster can now handle reports [08:39:28] <_joe_> oh I missed that [08:40:22] <_joe_> akosiaris: this patch can only be merged once we've migrated to puppetdb, though [08:41:49] why ? [08:41:53] it won't hurt I mean [08:42:09] we will just get some reports on the backends [08:42:17] but yes, it should probably be done in tandem [08:42:18] 06Operations, 06Performance-Team, 10Thumbor: Figure out a way to live-debug running production thumbor processes - https://phabricator.wikimedia.org/T146143#2655353 (10fgiunchedi) thanks @ori! yeah manhole seems like a good option, I don't see it packaged for Debian so we'll need to find a way to get it to t... [08:42:28] anyway, off to reimaging nihal for now [08:42:35] <_joe_> you have the handler "puppetdb" [08:42:40] <_joe_> anyways, yes, please :) [08:42:54] dependent on $::use_puppetdb though [08:45:02] (03CR) 10Giuseppe Lavagetto: [C: 04-1] puppetmaster: servermon report handler (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/311738 (owner: 10Alexandros Kosiaris) [08:45:27] (03CR) 10Giuseppe Lavagetto: "This is simple enough and I like it, but there are a few corner cases to consider" [puppet] - 10https://gerrit.wikimedia.org/r/311738 (owner: 10Alexandros Kosiaris) [08:45:40] <_joe_> akosiaris: if you want, I can amend [08:46:42] (03PS1) 10Hashar: Drop mira.deployment-prep.eqiad.wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/311939 [08:51:29] (03CR) 10Muehlenhoff: [C: 032] Drop mira.deployment-prep.eqiad.wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/311939 (owner: 10Hashar) [08:53:30] (03CR) 10Alexandros Kosiaris: puppetmaster: servermon report handler (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/311738 (owner: 10Alexandros Kosiaris) [08:54:36] RECOVERY - puppet last run on elastic2005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:55:07] (03PS1) 10Volans: Reimage: fix import [puppet] - 10https://gerrit.wikimedia.org/r/311940 (https://phabricator.wikimedia.org/T143536) [08:56:55] (03CR) 10Volans: [C: 032] Reimage: fix import [puppet] - 10https://gerrit.wikimedia.org/r/311940 (https://phabricator.wikimedia.org/T143536) (owner: 10Volans) [09:00:53] 06Operations, 10DBA: db1019: Decommission - https://phabricator.wikimedia.org/T146265#2655380 (10Marostegui) [09:02:17] akosiaris: FYI I'm going to reimage bast3001 shortly, I see you are using it [09:05:26] er... hmm [09:05:28] ok logging out [09:09:07] !log reimage bast3001.wikimedia.org with separate /srv [09:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:12:37] PROBLEM - Host elastic1027 is DOWN: PING CRITICAL - Packet loss = 100% [09:13:48] <_joe_> uhm [09:13:59] <_joe_> gehel: know anything about ^^ ? [09:14:32] _joe_: restart in progress, it seems that 1027 is taking more time than expected to reboot. Checking [09:17:40] (03CR) 10Ema: base: add run-no-puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/311671 (owner: 10Ema) [09:18:28] (03PS1) 10Marostegui: db-eqiad.php: Repool db1086 after the ALTER table [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311942 (https://phabricator.wikimedia.org/T141951) [09:19:44] !log powercycling elastic1027 - T145404 [09:19:45] T145404: Upgrade elasticsearch and plugins to 2.3.5 - https://phabricator.wikimedia.org/T145404 [09:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:20:59] 06Operations, 06Labs: Good bug reports - https://phabricator.wikimedia.org/T146266#2655420 (10Jishnugopim) [09:22:20] _joe_: powercycling elastic1027 failed (The RAC is unable to communicate with the BMC). Not really sure what the options are at this point... [09:22:35] <_joe_> racadm racreset [09:22:37] <_joe_> and try again [09:22:41] <_joe_> the step after that [09:22:43] _joe_: thanks! [09:23:06] <_joe_> is escalating to chris once he's in the dc [09:23:25] :( only so much we can do remotely... [09:23:40] <_joe_> if it's critical [09:23:46] <_joe_> we have smarthands support [09:23:52] <_joe_> but I'd avoid it if possible [09:24:06] <_joe_> also, would you prefer to work from a datacenter? ;) [09:24:07] PROBLEM - puppet last run on pc2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:24:16] we have enough redondancy, loosing one elasticsearch node is not really an issue [09:24:24] <_joe_> exactly [09:24:47] <_joe_> that's why I suggested to wait for chris if racreset doesn't work [09:25:35] _joe_: thanks! I'll be in touch with chris [09:26:00] <_joe_> and, at least we have someone who knows what he's doing in the datacenter; at $JOB~1 we only had smarthands in the dc [09:26:17] <_joe_> and once, they powercycled a server in the wrong row [09:26:19] !log Stopping mysql at db1019 for a few days as it will be decommissioned - T146265 [09:26:20] T146265: db1019: Decommission - https://phabricator.wikimedia.org/T146265 [09:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:26:24] <_joe_> same rack/position, different row [09:26:33] <_joe_> which happened to be the active db master [09:26:34] <_joe_> :P [09:26:47] I can see why that would be an issue :) [09:28:07] 06Operations, 10DBA: db1019: Decommission - https://phabricator.wikimedia.org/T146265#2655439 (10Marostegui) This server was doing nothing but replicating, so mysql has been stopped: ``` MariaDB PRODUCTION s4 localhost (none) > show processlist; +-----------+-----------------+------------------+--------------... [09:33:21] (03PS4) 10Ema: base: add run-no-puppet [puppet] - 10https://gerrit.wikimedia.org/r/311671 [09:35:48] (03PS7) 10Elukey: Improve resilience during varnish (re)starts [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/311415 (https://phabricator.wikimedia.org/T138747) [09:38:57] (03CR) 10Elukey: [C: 032 V: 032] "The last version was only a s/aquired/acquired/ in the log strings." [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/311415 (https://phabricator.wikimedia.org/T138747) (owner: 10Elukey) [09:39:52] !log reimaging mw1173-mw1175 to jessie [09:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:40:05] (03PS5) 10Ema: base: add run-no-puppet [puppet] - 10https://gerrit.wikimedia.org/r/311671 [09:40:17] (03PS4) 10ArielGlenn: jobrunner: refactor rsyslog conf and let wikidev read log [puppet] - 10https://gerrit.wikimedia.org/r/311719 (https://phabricator.wikimedia.org/T146040) (owner: 10Hashar) [09:41:42] (03CR) 10ArielGlenn: [C: 032] jobrunner: refactor rsyslog conf and let wikidev read log [puppet] - 10https://gerrit.wikimedia.org/r/311719 (https://phabricator.wikimedia.org/T146040) (owner: 10Hashar) [09:41:44] (03PS1) 10Hashar: beta: add hiera deployment_server var from wikitech [puppet] - 10https://gerrit.wikimedia.org/r/311946 (https://phabricator.wikimedia.org/T144578) [09:41:46] (03PS1) 10Hashar: beta: switch deploy server to deployment-mira [puppet] - 10https://gerrit.wikimedia.org/r/311947 (https://phabricator.wikimedia.org/T144578) [09:44:32] 06Operations: puppet run stopping qrunner on fermium - https://phabricator.wikimedia.org/T144933#2614835 (10Joe) So no one looked into this in the last few days? I am going to need to have puppet running for the puppetdb migration, so looking into this now. [09:45:01] (03PS5) 10Alexandros Kosiaris: puppetmaster: servermon report handler [puppet] - 10https://gerrit.wikimedia.org/r/311738 [09:45:39] (03CR) 10Alexandros Kosiaris: puppetmaster: servermon report handler (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/311738 (owner: 10Alexandros Kosiaris) [09:45:56] (03CR) 10Alexandros Kosiaris: [C: 031] Monitor usage of in-memory elasticsearch datastructures [puppet] - 10https://gerrit.wikimedia.org/r/311848 (https://phabricator.wikimedia.org/T144387) (owner: 10EBernhardson) [09:46:13] (03CR) 10jenkins-bot: [V: 04-1] puppetmaster: servermon report handler [puppet] - 10https://gerrit.wikimedia.org/r/311738 (owner: 10Alexandros Kosiaris) [09:47:25] (03CR) 10Hashar: "Merely for sake of consistency and have everything defined at the same place (puppet). Cherry picked on beta puppet master" [puppet] - 10https://gerrit.wikimedia.org/r/311946 (https://phabricator.wikimedia.org/T144578) (owner: 10Hashar) [09:49:16] RECOVERY - puppet last run on pc2005 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [09:51:39] (03PS3) 10Ema: varnish: add varnish-fe restart script [puppet] - 10https://gerrit.wikimedia.org/r/311387 [09:51:45] (03CR) 10Ema: [C: 032 V: 032] varnish: add varnish-fe restart script [puppet] - 10https://gerrit.wikimedia.org/r/311387 (owner: 10Ema) [09:53:11] 06Operations, 10ops-eqiad, 06DC-Ops, 06Discovery-Search: elastic1027 does not reboot - https://phabricator.wikimedia.org/T146268#2655478 (10Gehel) [09:56:15] (03CR) 10Muehlenhoff: [C: 031] beta: add hiera deployment_server var from wikitech [puppet] - 10https://gerrit.wikimedia.org/r/311946 (https://phabricator.wikimedia.org/T144578) (owner: 10Hashar) [09:57:59] (03PS6) 10Alexandros Kosiaris: puppetmaster: servermon report handler [puppet] - 10https://gerrit.wikimedia.org/r/311738 [10:01:00] 06Operations: puppet run stopping qrunner on fermium - https://phabricator.wikimedia.org/T144933#2655490 (10Joe) The issue here is that `debconf::set` is very, very primitive. The issue was that the order of languages wasn't in the same order in debconf and in puppet: ``` root@fermium:~# echo get mailman/site_l... [10:01:04] 06Operations, 10MediaWiki-JobRunner, 07Beta-Cluster-reproducible, 13Patch-For-Review: wikidev people cant read /var/log/mediawiki/jobrunner.log - https://phabricator.wikimedia.org/T146040#2655491 (10hashar) a:03hashar **status** Trusty hosts are not impacted, the files are created via upstart redirectin... [10:02:03] (03PS1) 10Giuseppe Lavagetto: mailman::listserve: reproduce the debconf order from fermium [puppet] - 10https://gerrit.wikimedia.org/r/311950 (https://phabricator.wikimedia.org/T144933) [10:02:51] PROBLEM - puppet last run on ms-be1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:03:58] (03CR) 10Giuseppe Lavagetto: [C: 032] mailman::listserve: reproduce the debconf order from fermium [puppet] - 10https://gerrit.wikimedia.org/r/311950 (https://phabricator.wikimedia.org/T144933) (owner: 10Giuseppe Lavagetto) [10:04:06] (03PS2) 10Giuseppe Lavagetto: mailman::listserve: reproduce the debconf order from fermium [puppet] - 10https://gerrit.wikimedia.org/r/311950 (https://phabricator.wikimedia.org/T144933) [10:04:08] (03CR) 10Giuseppe Lavagetto: [V: 032] mailman::listserve: reproduce the debconf order from fermium [puppet] - 10https://gerrit.wikimedia.org/r/311950 (https://phabricator.wikimedia.org/T144933) (owner: 10Giuseppe Lavagetto) [10:06:14] !log rebooting lithium for kernel security update [10:06:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:06:23] (03PS6) 10Ema: base: add run-no-puppet [puppet] - 10https://gerrit.wikimedia.org/r/311671 [10:06:24] sigh, bast3001 after being asked to pxe boot just hangs there after POST [10:07:34] 06Operations, 13Patch-For-Review: puppet run stopping qrunner on fermium - https://phabricator.wikimedia.org/T144933#2655514 (10Joe) Now puppet runs fine on fermium and doesn't stop/start qrunner at each iteration, but I'll leave the ticket open because this is in need of some serious reengineering. [10:07:38] (03PS1) 10Elukey: Merge branch 'master' into debian [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/311951 [10:07:40] (03PS1) 10Elukey: Package last upstream 1.0.12-1 [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/311952 [10:07:49] PROBLEM - puppet last run on nihal is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 8 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/etc/postgresql/9.4/main/tuning.conf],File[/etc/postgresql/9.4/main/postgresql.conf],File[/etc/postgresql/ssl] [10:07:59] PROBLEM - salt-minion processes on nihal is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:08:53] 06Operations, 10Phabricator: phabricator: can't search for RT tickets (reference field) anymore - https://phabricator.wikimedia.org/T146116#2655516 (10Aklapper) Could reproduce locally; created https://secure.phabricator.com/T11675 [10:09:14] 06Operations, 10Phabricator (Upstream), 07Upstream: phabricator: can't search for RT tickets (reference field) anymore - https://phabricator.wikimedia.org/T146116#2655517 (10Aklapper) p:05Triage>03Low [10:09:26] godog: hardware... there's a few unused former cp hosts in esams, maybe we can repurpose one of those temporarily [10:09:43] (03Abandoned) 10Elukey: Merge branch 'master' into debian [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/311951 (owner: 10Elukey) [10:09:47] (03Abandoned) 10Elukey: Package last upstream 1.0.12-1 [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/311952 (owner: 10Elukey) [10:10:49] (03PS2) 10Muehlenhoff: beta: add hiera deployment_server var from wikitech [puppet] - 10https://gerrit.wikimedia.org/r/311946 (https://phabricator.wikimedia.org/T144578) (owner: 10Hashar) [10:12:22] moritzm: hehe I'll take a closer look, it might be just slow tftp across the pacific heh [10:12:50] (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/311738 (owner: 10Alexandros Kosiaris) [10:16:13] 06Operations, 10Phabricator (Upstream), 07Upstream: phabricator: can't search for RT tickets (reference field) anymore - https://phabricator.wikimedia.org/T146116#2655525 (10Paladox) Thankyou. [10:20:54] (03CR) 10Muehlenhoff: [C: 032] beta: add hiera deployment_server var from wikitech [puppet] - 10https://gerrit.wikimedia.org/r/311946 (https://phabricator.wikimedia.org/T144578) (owner: 10Hashar) [10:28:39] RECOVERY - salt-minion processes on nihal is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:32:27] (03PS1) 10Hashar: contint: labs instance all have /dev/vdb [puppet] - 10https://gerrit.wikimedia.org/r/311954 [10:33:14] RECOVERY - puppet last run on ms-be1008 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:37:12] PROBLEM - mediawiki-installation DSH group on mw1175 is CRITICAL: Host mw1175 is not in mediawiki-installation dsh group [10:37:22] PROBLEM - mediawiki-installation DSH group on mw1174 is CRITICAL: Host mw1174 is not in mediawiki-installation dsh group [10:37:41] PROBLEM - mediawiki-installation DSH group on mw1173 is CRITICAL: Host mw1173 is not in mediawiki-installation dsh group [10:38:00] (03CR) 10Volans: [C: 031] "Awesome! Thanks for accepting all suggestions" [puppet] - 10https://gerrit.wikimedia.org/r/311671 (owner: 10Ema) [10:38:02] PROBLEM - puppet last run on mw1175 is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 6 minutes ago with 8 failures. Failed resources (up to 3 shown): File[/etc/apache2/mods-available/setenvif.conf],File[/etc/apache2/mods-available/userdir.conf],File[/etc/apache2/mods-available/autoindex.conf],Package[fonts-noto-cjk] [10:38:21] PROBLEM - salt-minion processes on mw1175 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:38:56] <_joe_> elukey, moritzm can you stop with the reimagings for the remainder of the week? [10:39:13] PROBLEM - Apache HTTP on mw1174 is CRITICAL: Connection refused [10:39:33] PROBLEM - Apache HTTP on mw1173 is CRITICAL: Connection refused [10:40:02] _joe_ sure, I am finishing the last two api and then I'll stop [10:40:42] RECOVERY - puppet last run on nihal is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:41:13] PROBLEM - puppet last run on ms-be1021 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 23 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[parted-/dev/sdh] [10:41:25] (03PS7) 10Ema: base: add run-no-puppet [puppet] - 10https://gerrit.wikimedia.org/r/311671 [10:41:34] (03CR) 10Ema: [C: 032 V: 032] base: add run-no-puppet [puppet] - 10https://gerrit.wikimedia.org/r/311671 (owner: 10Ema) [10:41:51] RECOVERY - Apache HTTP on mw1174 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.065 second response time [10:42:03] RECOVERY - Apache HTTP on mw1173 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.095 second response time [10:45:26] !log adding mw1196 back to serving live traffic after the reimage [10:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:47:42] PROBLEM - puppet last run on mc2009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/run-no-puppet] [10:49:01] (03PS1) 10Mobrovac: RESTBase config: Add Swagger UI header info [puppet] - 10https://gerrit.wikimedia.org/r/311958 [10:49:06] (03PS1) 10Hashar: contint: migrate slaves to /srv [puppet] - 10https://gerrit.wikimedia.org/r/311959 [10:49:41] (03CR) 10Hashar: [C: 04-1] "Merely for beta cluster. For CI slaves a bunch of other manifests have to be adjusted." [puppet] - 10https://gerrit.wikimedia.org/r/311959 (owner: 10Hashar) [10:49:42] PROBLEM - Varnishkafka log producer on cp1048 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [10:50:23] buuu [10:50:32] PROBLEM - puppet last run on lvs1007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/run-no-puppet] [10:51:45] !log restarted varnishkafka on cp1048 (VSLQ_Dispatch: Varnish Log abandoned or overrun.) [10:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:52:21] RECOVERY - Varnishkafka log producer on cp1048 is OK: PROCS OK: 1 process with command name varnishkafka [10:53:36] (03PS2) 10Hashar: contint: migrate slaves to /srv [puppet] - 10https://gerrit.wikimedia.org/r/311959 [10:53:52] RECOVERY - salt-minion processes on mw1175 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:57:37] (03PS1) 10Alexandros Kosiaris: puppetdb: Have postgres users deployed on slave as well [puppet] - 10https://gerrit.wikimedia.org/r/311962 [11:00:12] PROBLEM - check_mysql on frdb1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1208 [11:00:29] (03CR) 10Alexandros Kosiaris: [C: 032] puppetdb: Have postgres users deployed on slave as well [puppet] - 10https://gerrit.wikimedia.org/r/311962 (owner: 10Alexandros Kosiaris) [11:00:33] (03PS2) 10Alexandros Kosiaris: puppetdb: Have postgres users deployed on slave as well [puppet] - 10https://gerrit.wikimedia.org/r/311962 [11:00:35] (03CR) 10Alexandros Kosiaris: [V: 032] puppetdb: Have postgres users deployed on slave as well [puppet] - 10https://gerrit.wikimedia.org/r/311962 (owner: 10Alexandros Kosiaris) [11:03:33] !log adding mw1197 back to serving live traffic after the reimage [11:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:03:41] PROBLEM - Host bast3001 is DOWN: PING CRITICAL - Packet loss = 100% [11:04:07] !log Rebuilding tables in db1082 (non pooled) - T137191 [11:04:08] T137191: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191 [11:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:05:11] RECOVERY - check_mysql on frdb1001 is OK: Uptime: 513382 Threads: 1 Questions: 111497767 Slow queries: 4015 Opens: 4333 Flush tables: 1 Open tables: 569 Queries per second avg: 217.182 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [11:07:35] !log restbase deploy start of ca55669 [11:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:08:13] PROBLEM - puppet last run on nihal is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 4 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[pass_set-puppetdb@nihal-v4],Exec[pass_set-replication@nihal-v4],Exec[pass_set-puppetdb@localhost] [11:09:03] <_joe_> akosiaris: nihal seems ok now, though [11:10:31] RECOVERY - Host bast3001 is UP: PING OK - Packet loss = 0%, RTA = 84.79 ms [11:11:30] _joe_: yes it is, damn puppet... [11:11:34] needs some more fixing [11:12:20] <_joe_> ahah yes I just saw [11:12:26] <_joe_> but it's ok to use now, though [11:12:31] <_joe_> thanks :)) [11:12:32] RECOVERY - puppet last run on mc2009 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [11:13:13] RECOVERY - puppet last run on lvs1007 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [11:13:14] (03CR) 10Volans: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311942 (https://phabricator.wikimedia.org/T141951) (owner: 10Marostegui) [11:13:49] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1086 after the ALTER table [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311942 (https://phabricator.wikimedia.org/T141951) (owner: 10Marostegui) [11:14:17] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1086 after the ALTER table [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311942 (https://phabricator.wikimedia.org/T141951) (owner: 10Marostegui) [11:16:27] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1086 after the ALTER table - T141951 (duration: 00m 47s) [11:16:28] T141951: Add local_user_id and global_user_id fields to localuser table in centralauth database - https://phabricator.wikimedia.org/T141951 [11:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:17:11] PROBLEM - puppet last run on bast3001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Service[ganglia-monitor] [11:17:53] RECOVERY - puppet last run on ms-be1021 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [11:18:33] RECOVERY - puppet last run on ms-be1022 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:21:02] PROBLEM - puppet last run on helium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:24:00] (03PS1) 10Elukey: Improve resilience during varnish (re)starts [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/311965 (https://phabricator.wikimedia.org/T138747) [11:25:12] PROBLEM - puppet last run on carbon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:25:42] (03PS1) 10Elukey: Package last upstream 1.0.12-1 [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/311967 [11:25:45] !log restbase deploy end of ca55669 [11:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:26:22] RECOVERY - puppet last run on nihal is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:26:32] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:27:19] <_joe_> arg the precise errors are mine [11:27:28] <_joe_> ignore those [11:28:51] RECOVERY - puppet last run on ms-be3003 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [11:29:03] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:29:03] PROBLEM - puppet last run on labsdb1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:30:04] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:30:32] PROBLEM - puppet last run on labsdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:32:44] PROBLEM - puppet last run on labstore1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:32:52] RECOVERY - puppet last run on carbon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:33:26] PROBLEM - puppet last run on potassium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:34:01] (03Abandoned) 10Cenarium: Move account creation throttle to ping limiter and remove noratelimit from account creators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266454 (https://phabricator.wikimedia.org/T85538) (owner: 10Cenarium) [11:34:01] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:34:02] PROBLEM - puppet last run on nihal is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 5 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[pass_set-puppetdb@nihal-v4],Exec[pass_set-replication@nihal-v4],Exec[pass_set-puppetdb@localhost] [11:34:13] PROBLEM - puppet last run on cp3035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:34:25] (03CR) 10Cenarium: "... due to throttle being moved to AuthManager" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266454 (https://phabricator.wikimedia.org/T85538) (owner: 10Cenarium) [11:34:33] PROBLEM - puppet last run on fluorine is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:35:29] 06Operations: ganglia-monitor and puppet failing on bast3001 - https://phabricator.wikimedia.org/T144778#2655673 (10fgiunchedi) 05Resolved>03Open This is back after a bast3001 reboot, doesn't look like it can come back clean only after a reboot. Looks like the easiest might be to provide a native systemd ser... [11:38:41] RECOVERY - mediawiki-installation DSH group on mw1175 is OK: OK [11:38:54] RECOVERY - mediawiki-installation DSH group on mw1174 is OK: OK [11:39:12] RECOVERY - mediawiki-installation DSH group on mw1173 is OK: OK [11:40:32] RECOVERY - puppet last run on labstore1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:40:42] RECOVERY - puppet last run on labsdb1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:41:53] RECOVERY - puppet last run on cp3035 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:42:02] PROBLEM - puppet last run on wtp2012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:42:13] 06Operations: ganglia-monitor and puppet failing on bast3001 - https://phabricator.wikimedia.org/T144778#2655682 (10MoritzMuehlenhoff) Agreed, the service should start reliably without manual intervention and providing a native ganglia-monitor service unit is straightforward [11:43:32] RECOVERY - puppet last run on potassium is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [11:44:12] RECOVERY - puppet last run on helium is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [11:44:53] RECOVERY - puppet last run on fluorine is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:45:44] !log rolling restart of trusty swift backend servers in codfw for kernel security update [11:45:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:47:02] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:47:12] RECOVERY - puppet last run on wtp2012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:48:21] PROBLEM - puppet last run on carbon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:49:23] (03PS1) 10Marostegui: db-eqiad.php: Temporarily depool db1094 for ALTER [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311968 (https://phabricator.wikimedia.org/T141951) [11:49:31] RECOVERY - puppet last run on nihal is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [11:50:12] PROBLEM - puppet last run on labsdb1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:51:02] PROBLEM - puppet last run on labsdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:51:31] PROBLEM - puppet last run on netmon1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:53:51] PROBLEM - puppet last run on ms-be1021 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[parted-/dev/sdh] [11:54:04] RECOVERY - puppet last run on netmon1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:54:58] (03PS1) 10Alexandros Kosiaris: postgres: Allow to not set password for users if not on master [puppet] - 10https://gerrit.wikimedia.org/r/311969 [11:55:04] PROBLEM - puppet last run on fluorine is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:55:52] PROBLEM - puppet last run on labstore1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:56:21] PROBLEM - puppet last run on potassium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:57:11] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [11:57:42] RECOVERY - puppet last run on labsdb1007 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [12:02:16] (03CR) 10Muehlenhoff: Create a new LDAP schema extension for custom user attributes (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/311694 (https://phabricator.wikimedia.org/T146102) (owner: 10Muehlenhoff) [12:02:31] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:03:41] (03PS2) 10Muehlenhoff: Create a new LDAP schema extension for custom user attributes [puppet] - 10https://gerrit.wikimedia.org/r/311694 (https://phabricator.wikimedia.org/T146102) [12:03:54] RECOVERY - puppet last run on carbon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:06:40] (03CR) 10Volans: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311968 (https://phabricator.wikimedia.org/T141951) (owner: 10Marostegui) [12:07:00] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Temporarily depool db1094 for ALTER [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311968 (https://phabricator.wikimedia.org/T141951) (owner: 10Marostegui) [12:07:26] (03Merged) 10jenkins-bot: db-eqiad.php: Temporarily depool db1094 for ALTER [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311968 (https://phabricator.wikimedia.org/T141951) (owner: 10Marostegui) [12:08:24] (03CR) 10Alexandros Kosiaris: [C: 031] Create a new LDAP schema extension for custom user attributes [puppet] - 10https://gerrit.wikimedia.org/r/311694 (https://phabricator.wikimedia.org/T146102) (owner: 10Muehlenhoff) [12:09:20] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1094 for an ALTER table - T141951 (duration: 00m 47s) [12:09:21] T141951: Add local_user_id and global_user_id fields to localuser table in centralauth database - https://phabricator.wikimedia.org/T141951 [12:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:09:44] (03PS1) 10Filippo Giunchedi: ganglia: ship native systemd service unit [puppet] - 10https://gerrit.wikimedia.org/r/311970 (https://phabricator.wikimedia.org/T144778) [12:11:21] RECOVERY - puppet last run on labstore1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:11:31] (03CR) 10Filippo Giunchedi: "tested on bast3001" [puppet] - 10https://gerrit.wikimedia.org/r/311970 (https://phabricator.wikimedia.org/T144778) (owner: 10Filippo Giunchedi) [12:11:32] RECOVERY - puppet last run on labsdb1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:12:41] PROBLEM - puppet last run on nihal is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 1 minute ago with 3 failures. Failed resources (up to 3 shown): Exec[pass_set-puppetdb@nihal-v4],Exec[pass_set-replication@nihal-v4],Exec[pass_set-puppetdb@localhost] [12:14:21] RECOVERY - puppet last run on potassium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:15:42] RECOVERY - puppet last run on fluorine is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:16:22] RECOVERY - puppet last run on bast3001 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [12:17:43] RECOVERY - puppet last run on labsdb1005 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [12:18:51] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:19:45] PROBLEM - puppet last run on db1050 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:21:19] (03PS1) 10Marostegui: db-eqiad.php: Repool db1094 after the ALTER [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311971 (https://phabricator.wikimedia.org/T141951) [12:24:21] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [12:24:42] (03CR) 10Volans: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311971 (https://phabricator.wikimedia.org/T141951) (owner: 10Marostegui) [12:27:01] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4316725 keys - replication_delay is 0 [12:28:18] jouncebot, next [12:28:18] In 0 hour(s) and 31 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160921T1300) [12:28:40] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1094 after the ALTER [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311971 (https://phabricator.wikimedia.org/T141951) (owner: 10Marostegui) [12:29:07] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1094 after the ALTER [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311971 (https://phabricator.wikimedia.org/T141951) (owner: 10Marostegui) [12:29:52] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me. Maybe also submit as a bug in Debian? The current ganglia package in unstable only ships a sysvinit script." [puppet] - 10https://gerrit.wikimedia.org/r/311970 (https://phabricator.wikimedia.org/T144778) (owner: 10Filippo Giunchedi) [12:30:14] (03CR) 10Alexandros Kosiaris: [C: 032] postgres: Allow to not set password for users if not on master [puppet] - 10https://gerrit.wikimedia.org/r/311969 (owner: 10Alexandros Kosiaris) [12:30:17] (03PS2) 10Alexandros Kosiaris: postgres: Allow to not set password for users if not on master [puppet] - 10https://gerrit.wikimedia.org/r/311969 [12:30:20] (03CR) 10Alexandros Kosiaris: [V: 032] postgres: Allow to not set password for users if not on master [puppet] - 10https://gerrit.wikimedia.org/r/311969 (owner: 10Alexandros Kosiaris) [12:30:35] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1094 after the ALTER table - T141951 (duration: 00m 47s) [12:30:36] T141951: Add local_user_id and global_user_id fields to localuser table in centralauth database - https://phabricator.wikimedia.org/T141951 [12:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:34:52] (03PS1) 10Alexandros Kosiaris: postgres::user: Move dependency to pass_set exec [puppet] - 10https://gerrit.wikimedia.org/r/311972 [12:36:06] (03CR) 10jenkins-bot: [V: 04-1] postgres::user: Move dependency to pass_set exec [puppet] - 10https://gerrit.wikimedia.org/r/311972 (owner: 10Alexandros Kosiaris) [12:38:19] PROBLEM - puppet last run on cp3015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:43:56] (03PS2) 10Alexandros Kosiaris: postgres::user: Move dependency to pass_set exec [puppet] - 10https://gerrit.wikimedia.org/r/311972 [12:45:44] RECOVERY - puppet last run on db1050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:48:11] (03Abandoned) 10Aude: Update aude's ssh key [puppet] - 10https://gerrit.wikimedia.org/r/311393 (owner: 10Aude) [12:49:24] (03CR) 10Alexandros Kosiaris: [C: 032] postgres::user: Move dependency to pass_set exec [puppet] - 10https://gerrit.wikimedia.org/r/311972 (owner: 10Alexandros Kosiaris) [12:49:28] (03PS3) 10Alexandros Kosiaris: postgres::user: Move dependency to pass_set exec [puppet] - 10https://gerrit.wikimedia.org/r/311972 [12:49:31] (03CR) 10Alexandros Kosiaris: [V: 032] postgres::user: Move dependency to pass_set exec [puppet] - 10https://gerrit.wikimedia.org/r/311972 (owner: 10Alexandros Kosiaris) [12:49:52] (03PS1) 10Yurik: Add new table params for geoshape service [puppet] - 10https://gerrit.wikimedia.org/r/311976 [12:50:27] (03PS2) 10Gehel: Add new table params for geoshape service [puppet] - 10https://gerrit.wikimedia.org/r/311976 (owner: 10Yurik) [12:51:57] (03CR) 10Gehel: [C: 032] Add new table params for geoshape service [puppet] - 10https://gerrit.wikimedia.org/r/311976 (owner: 10Yurik) [12:53:22] akosiaris: I bumped into your change on postgresql::user during puppet-merge. Should I merge it as well= [12:53:33] akosiaris: it looks trivial enough to me... [12:53:43] yes [12:53:52] akosiaris: thanks, will do! [12:59:12] RECOVERY - puppet last run on nihal is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:00:04] hashar, Dereckson, addshore, and aude: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160921T1300). Please do the needful. [13:00:04] Krenair, aude, phuedx, and yurik: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:12] (03CR) 10Elukey: [C: 032 V: 032] Improve resilience during varnish (re)starts [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/311965 (https://phabricator.wikimedia.org/T138747) (owner: 10Elukey) [13:00:35] (03CR) 10Elukey: [C: 032 V: 032] Package last upstream 1.0.12-1 [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/311967 (owner: 10Elukey) [13:01:09] ah snap gerrit tricked me [13:01:42] RECOVERY - puppet last run on cp3015 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [13:02:36] I put -r for debian branch but I can see only the topic [13:02:37] grrr [13:02:57] here [13:03:50] i/ [13:03:50] here [13:04:01] * aude waves [13:04:21] o/ [13:04:29] hi [13:04:37] i'm a little confused the status of the deployment branches [13:04:43] we have wmf20 now? [13:04:51] aude, everyone is ;) [13:04:59] aude, are you in Brussels? [13:05:04] yurik: tonight [13:05:05] (03PS2) 10Hashar: New wikitext editor: Enable the Beta Feature in Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311877 (owner: 10Jforrester) [13:05:10] awesome, i'm already here [13:05:12] tomorrow is the HOT summit [13:05:13] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311877 (owner: 10Jforrester) [13:05:24] aude, i was thinking of going to FOSS4G [13:05:32] is HOT worth visiting? [13:05:34] doing them in order [13:05:39] (03Merged) 10jenkins-bot: New wikitext editor: Enable the Beta Feature in Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311877 (owner: 10Jforrester) [13:05:48] yurik: you get patch for wmf20 and wmf19 [13:05:48] yurik: i'm a member of HOT, so yes for me at least [13:05:49] hashar, random is an order :) [13:06:01] yurik: prod runs on wmf.18 for now and wmf.20 has not been cut!?! [13:06:18] hashar: my patch should apply to wmf18 + wmf19 (in case it gets deployed again) + wmf20 [13:06:32] hashar, just like aude, i am highly confused - i looked at the train schedule - it said 19->20 :) [13:06:46] oh my [13:06:46] we have cut wmf.20 :D [13:06:56] can you CR+2 the patches please ? [13:07:11] yeh! and group0 is going to wmf19 tommorrow I think! [13:07:13] my point exactly :) do you want 19 & 20, or 18 as well? [13:07:16] (or maybe today).... [13:07:40] hashar, pls +2 the ones you think are right, i am very confused by the train status [13:07:41] hashar: you are doing swat today? [13:07:43] the combination of wikibase and mediawiki core in wmf20 potentially breaks things [13:07:58] likely (hence what my thing in swat helps prevent) [13:07:59] Krenair: 311877 New wikitext editor: Enable the Beta Feature in Beta Cluster is on mw1099 [13:08:04] this will be a fun week [13:08:13] plus ops heading out [13:08:32] (03PS1) 10Elukey: Revert "Improve resilience during varnish (re)starts" [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/311978 [13:08:33] yurik: i've scheduled https://gerrit.wikimedia.org/r/#/c/311398/ to be deployed now fyi [13:08:35] aude cr+2 ed your change [13:08:39] (also, hey) [13:08:42] thanks [13:08:44] hashar, okay. it doesn't do anything in prod [13:08:48] Hello. yurik: you can find https://wikitech.wikimedia.org/wiki/Deployments/Holding_the_train#What_happens_in_SWAT_while_the_train_is_on_hold.3F useful [13:09:31] Dereckson, thanks! I was looking at the deployment page - the trains still have the old numbers, that's why its confusing [13:10:02] hashar, you can push it out to all the other servers now [13:10:15] !log hashar@tin Synchronized wmf-config: New wikitext editor: Enable the Beta Feature in Beta Cluster (duration: 00m 51s) [13:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:10:21] yurik: we've a dashboard: https://tools.wmflabs.org/versions/ [13:10:31] oh, nice!!! [13:10:38] thanks, haven't seen that one yet [13:10:48] yurik: so for your patch, isn't it needed on wmf.18 which we currently use ? [13:10:52] Dereckson: ZOMG!!1 [13:10:58] hashar, yes [13:11:03] * phuedx bookmarks [13:11:09] yurik: so need yet another cherry pick :) [13:12:34] hashar, https://gerrit.wikimedia.org/r/#/c/311980/ [13:12:46] I should have tested the patch [13:12:46] (03CR) 10Elukey: [C: 032 V: 032] Revert "Improve resilience during varnish (re)starts" [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/311978 (owner: 10Elukey) [13:12:46] damn [13:12:54] Notice: Undefined variable: wmgVisualEditorEnableWikitext in /srv/mediawiki/wmf-config/CommonSettings.php on line 2157 [13:13:05] :( [13:13:16] a single call to mw1099 would have caught it I guess [13:13:27] !log hashar@tin Synchronized wmf-config: New wikitext editor: Enable the Beta Feature in Beta Cluster (duration: 00m 50s) [13:13:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:13:37] Dereckson, i thought that we moved zerowiki to group1? [13:13:50] (03PS1) 10Hashar: Revert "New wikitext editor: Enable the Beta Feature in Beta Cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311981 [13:13:59] you synchronised commonsettings before initialisesettings or something? [13:14:27] I don't see a problem with the patch [13:14:57] (03CR) 10Hashar: "Reverted via https://gerrit.wikimedia.org/r/#/c/311981/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311877 (owner: 10Jforrester) [13:15:10] (03CR) 10Hashar: [C: 032] Revert "New wikitext editor: Enable the Beta Feature in Beta Cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311981 (owner: 10Hashar) [13:15:26] hashar, stop [13:15:34] I already reverted on the cluster [13:15:48] (03Merged) 10jenkins-bot: Revert "New wikitext editor: Enable the Beta Feature in Beta Cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311981 (owner: 10Hashar) [13:15:49] Why? [13:16:48] hashar... [13:16:49] cause my terminal instantly exploded [13:16:52] so will try again after [13:16:58] panic [13:16:59] !log hashar@tin Synchronized wmf-config: (no message) (duration: 00m 49s) [13:17:00] revert [13:17:03] which log did the warnings come up in? [13:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:17:16] the process is: [13:17:16] A) notice error [13:17:20] B) REVERT DEPLOY NOW [13:17:33] C) send to revert patch to gerrit / rebase tin / deploy revert dummy [13:17:35] D) think [13:17:56] 06Operations, 10Ops-Access-Requests: Requesting access to stat1002 / webrequest logs for MelodyKramer - https://phabricator.wikimedia.org/T145387#2655815 (10MelodyKramer) Thank you @Dzahn! [13:18:13] (03PS2) 10Hashar: Zero: Make remote config explicit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311398 (https://phabricator.wikimedia.org/T145227) (owner: 10Phuedx) [13:18:25] phuedx: going to push your change. Can it be tested on mw1099? [13:18:34] Why are you moving on? [13:18:37] You haven't done mine yet [13:18:46] your broke [13:18:49] so will revisit after [13:18:57] hashar: no, unfortunately not -- wmgZeroBanner is only truthy on enwiki [13:19:15] i believe it's a noop change and yurik can confirm [13:19:19] phuedx: well you can surely visit en.wikipedia.org while pointing on mw1099 ? :) [13:19:21] okkk [13:19:24] no [13:19:27] yep [13:19:31] you should've sync-file'd InitialiseSettings [13:20:15] (03CR) 10Hashar: [C: 032] Zero: Make remote config explicit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311398 (https://phabricator.wikimedia.org/T145227) (owner: 10Phuedx) [13:20:41] (03Merged) 10jenkins-bot: Zero: Make remote config explicit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311398 (https://phabricator.wikimedia.org/T145227) (owner: 10Phuedx) [13:21:33] hashar: installed the extension and will test on 1099 [13:21:33] phuedx: yurik mobile done [13:21:42] * yurik hides [13:22:10] Krenair: I will redo it dont worry :) [13:22:13] hashar: 1099 should explode if the config is wrong, as in //explode// with a jsonconfig exception [13:22:59] phuedx: looks like it is all happy [13:23:08] hashar: agreed [13:23:27] yurik: you can come out of hiding now ;) [13:23:32] hashar: i don't see submodule bumps in mediawiki [13:23:46] aude I have only CR+2 ed the mw ones [13:23:51] https://phabricator.wikimedia.org/diffusion/MW/history/wmf%252F1.28.0-wmf.18/ [13:23:51] will do the deploy on mw1099 soonish [13:23:59] mine got merged [13:24:01] phuedx: got some errand scap lock failure for whatever reason [13:24:05] * yurik starts breathing again [13:24:06] I have the scap lock. [13:24:12] wtf ? [13:24:20] https://gerrit.wikimedia.org/r/#/c/311453/ [13:24:27] * yurik goes back into hiding [13:24:51] https://phabricator.wikimedia.org/diffusion/MW/browse/wmf%252F1.28.0-wmf.18/.gitmodules is tracking wmf/1.28.0-wmf.18 "Wikidata" [13:24:53] * yurik wonders if "format /" works on linux [13:25:02] Krenair: release the lock please [13:25:13] I'm still trying to find the error you said came up on my patch [13:25:13] * aude could submit manual submodule bumps if needed [13:25:27] Krenair: I told you i will revisit your patch after [13:26:25] Looks like they went through hhvm.log, though I'm still not sure why [13:26:28] phuedx: well got it on mw1099 [13:26:43] hashar: ? [13:27:18] (03PS1) 10Alex Monk: Revert "Revert "New wikitext editor: Enable the Beta Feature in Beta Cluster"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311983 [13:28:14] aude your wikidata bump for wmf.18 is on mw1099 [13:28:21] ok :) [13:28:23] * aude checks [13:28:53] The scap lock is released [13:29:12] looks ok [13:29:24] yurik: and I CR+2 your wmf.18 [13:29:48] going to scap wmf.19 and wmf.20 [13:30:01] thx! [13:30:04] RECOVERY - Host elastic1027 is UP: PING OK - Packet loss = 0%, RTA = 2.21 ms [13:30:11] 06Operations, 10ops-eqiad, 06DC-Ops, 06Discovery-Search: elastic1027 does not reboot - https://phabricator.wikimedia.org/T146268#2655830 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson Appears that the idrac was hung. Powered off, drained flea power and powered up and booted to OS. [13:31:44] !log hashar@tin Synchronized php-1.28.0-wmf.19/extensions/Kartographer/: (no message) (duration: 00m 50s) [13:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:32:37] Krenair: ok back to your. So basically I screwed it up by running scap sync-dir wmf-config when the order matter isn't it ? [13:32:45] yes [13:32:52] :( [13:32:54] I am such a noob [13:32:57] hashar, is 18 done? [13:33:25] yurik: it is on mw1099 for phuedx :) [13:33:37] or I am confused [13:33:43] hashar, that's a different patch :) [13:33:52] i claim no ownership over that one :D [13:34:00] the kartographer one yeah should be on mw1099 [13:34:01] even though i know exactly what's its for [13:34:05] ah, ok [13:34:08] 06Operations: Build poolcounter for jessie - https://phabricator.wikimedia.org/T146277#2655845 (10MoritzMuehlenhoff) [13:34:16] !log hashar@tin Synchronized php-1.28.0-wmf.19/extensions/Wikidata: (no message) (duration: 02m 22s) [13:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:34:22] yurik: camaraderie! [13:34:24] :D [13:34:36] hashar, seems good with kartographer :) [13:34:40] hehe [13:34:46] yurik: ah no. I was waiting for CI to finish :) [13:35:09] Kartographer patch for wmf.18 is now on mw1099 [13:35:12] hehe :) ok, next time i will do a more extensive test :) [13:35:19] testing.. [13:35:59] hashar: scap-dir isn't atomic enough, so some requests were served with new CS, old IS, the logs are polluted 50-120 minutes afterwards with the notices. [13:36:03] hashar, all good [13:36:07] sync-dir [13:37:06] 06Operations, 10ops-eqiad: mw1172 stuck after reboot - https://phabricator.wikimedia.org/T146263#2655880 (10Cmjohnson) 05Open>03Resolved drained flea power...server booted normally. [13:37:13] !log adding planet_osm_lines and roads indexes on maps* [13:37:16] hashar: https://phabricator.wikimedia.org/T141913 was filled about that, and good news, scap3 will solve the issue [13:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:37:21] !log hashar@tin Synchronized php-1.28.0-wmf.18/extensions/Kartographer: For yurik or phuedx? :D (duration: 00m 48s) [13:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:37:27] hashar: 1099 seems fine with the config change (going through enwiki) [13:37:29] (03PS1) 10Hashar: New wikitext editor: Enable the Beta Feature in Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311990 [13:37:35] lol [13:37:42] hashar: kartographer is for yurik [13:37:46] config change for zero is mine [13:38:17] (03CR) 10Hashar: [C: 032] "Had it reverted because I screwed up the sync order :(" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311990 (owner: 10Hashar) [13:38:52] (03Merged) 10jenkins-bot: New wikitext editor: Enable the Beta Feature in Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311990 (owner: 10Hashar) [13:39:05] phuedx: syncing [13:39:49] !log hashar@tin Synchronized wmf-config/mobile.php: For phuedx or is that for yurik? (duration: 00m 47s) [13:39:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:40:43] throws some Belgium chocolate at hashar... I heard it helps with memory and other cognitive abilities :-P [13:41:01] Dereckson: so what ends up to be the proper sync order? :} [13:41:12] PROBLEM - puppet last run on mw1293 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:41:23] hashar, stop wikipedia, deploy all, restart :-P [13:41:45] hashar: one the two IS, then the CS [13:41:58] (insert harddrive reformats and cache flushes, and machine reboots in the middle) [13:42:35] one day I will rename IS and CS [13:42:40] I find the names misleading [13:43:09] Dereckson: thanks [13:43:16] Krenair: sorry to have freaked out. Processing [13:43:19] !log hashar@tin Synchronized wmf-config/InitialiseSettings-labs.php: (no message) (duration: 00m 48s) [13:43:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:44:12] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 46s) [13:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:44:53] (03CR) 10Filippo Giunchedi: [C: 031] Monitor usage of in-memory elasticsearch datastructures [puppet] - 10https://gerrit.wikimedia.org/r/311848 (https://phabricator.wikimedia.org/T144387) (owner: 10EBernhardson) [13:45:08] !log hashar@tin Synchronized wmf-config/CommonSettings.php: (no message) (duration: 00m 46s) [13:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:45:28] it still emitted a bunch of notices on a few servers [13:45:30] but not much [13:45:44] so all is complete [13:46:21] hm, it did temporarily [13:46:23] strange [13:46:40] regardless I think it is complete now :} [13:46:45] either way I think PHP defaults to the value we want when that happens, so it's ok [13:46:46] Krenair: sorry for the freak out [13:47:05] that's ok [13:48:22] !log European SWAT completed [13:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:52:10] (03PS1) 10BBlack: text frontend VCL: copy 4-hit-wonder from upload [puppet] - 10https://gerrit.wikimedia.org/r/311994 [13:56:21] (03CR) 10Filippo Giunchedi: "good idea! reported as #838490" [puppet] - 10https://gerrit.wikimedia.org/r/311970 (https://phabricator.wikimedia.org/T144778) (owner: 10Filippo Giunchedi) [13:58:21] (03CR) 10Alexandros Kosiaris: [C: 031] Switch codfw and ulsfo to puppetmaster2001/puppetdb [dns] - 10https://gerrit.wikimedia.org/r/311435 (owner: 10Giuseppe Lavagetto) [13:58:25] (03PS1) 10BBlack: upload storage: transition cp1048+cp1049 [puppet] - 10https://gerrit.wikimedia.org/r/311996 [13:58:27] (03PS1) 10BBlack: upload storage: transition cp1050+cp1062 [puppet] - 10https://gerrit.wikimedia.org/r/311997 [13:58:29] (03PS1) 10BBlack: upload storage: transition cp1063+cp1064 [puppet] - 10https://gerrit.wikimedia.org/r/311998 [13:58:31] (03PS1) 10BBlack: upload storage: transition cp1071+cp1072 [puppet] - 10https://gerrit.wikimedia.org/r/311999 [13:58:33] (03PS1) 10BBlack: upload storage: finish up eqiad (cp1073+cp1074) [puppet] - 10https://gerrit.wikimedia.org/r/312000 [13:58:35] (03PS2) 10Giuseppe Lavagetto: Switch codfw and ulsfo to puppetmaster2001/puppetdb [dns] - 10https://gerrit.wikimedia.org/r/311435 [13:58:44] 06Operations, 06Labs, 10Labs-Infrastructure, 07Wikimedia-Incident: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#2655948 (10AlexMonk-WMF) >>! In T146212#2655916, @AlexMonk-WMF wrote: > We looked into it last night, but weren't able to find the cause. We... [13:58:59] !log disabled puppet on neon, puppet migration in progress [13:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:59:05] (03CR) 10Giuseppe Lavagetto: [C: 032] Switch codfw and ulsfo to puppetmaster2001/puppetdb [dns] - 10https://gerrit.wikimedia.org/r/311435 (owner: 10Giuseppe Lavagetto) [14:00:17] I assume we should probably halt puppet repo merges for the moment to avoid excess excitement :) [14:00:29] (03CR) 10Giuseppe Lavagetto: [V: 032] Switch codfw and ulsfo to puppetmaster2001/puppetdb [dns] - 10https://gerrit.wikimedia.org/r/311435 (owner: 10Giuseppe Lavagetto) [14:03:26] <_joe_> bblack: dunno, it should be ok, but maybe wait a few minutes? [14:06:52] RECOVERY - puppet last run on mw1293 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:09:02] PROBLEM - mediawiki-installation DSH group on mw1172 is CRITICAL: Host mw1172 is not in mediawiki-installation dsh group [14:10:29] <_joe_> why is ^^ ? [14:10:38] <_joe_> a reimaging still ongoning? [14:11:18] no, that harmless [14:11:43] that host went down earlier the day and Chris fixed it, so it's back up [14:11:54] <_joe_> ok :) [14:11:55] <_joe_> thanks [14:11:57] and the icinga acknowledgement expired [14:12:11] _joe_: conftool changes ok ATM? [14:12:28] (for repooling, otherwise I'll just silence it) [14:12:39] <_joe_> yes [14:14:50] (03PS1) 10Ema: run-no-puppet: do not use brackets in disable message [puppet] - 10https://gerrit.wikimedia.org/r/312004 [14:15:09] (03PS1) 10Giuseppe Lavagetto: role::prometheus: correct puppetdb template [puppet] - 10https://gerrit.wikimedia.org/r/312005 [14:15:26] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] role::prometheus: correct puppetdb template [puppet] - 10https://gerrit.wikimedia.org/r/312005 (owner: 10Giuseppe Lavagetto) [14:21:10] (03PS2) 10Ema: run-no-puppet: do not interpret grep pattern as a regex [puppet] - 10https://gerrit.wikimedia.org/r/312004 [14:25:08] (03PS1) 10Giuseppe Lavagetto: role::prometheus: use the hostname, not the fqdn [puppet] - 10https://gerrit.wikimedia.org/r/312006 [14:25:35] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] role::prometheus: use the hostname, not the fqdn [puppet] - 10https://gerrit.wikimedia.org/r/312006 (owner: 10Giuseppe Lavagetto) [14:30:33] (03PS1) 10Alexandros Kosiaris: Switch eqiad, esams and wikimedia.org puppetmaster2001/puppetdb [dns] - 10https://gerrit.wikimedia.org/r/312007 [14:31:12] (03CR) 10Alexandros Kosiaris: [C: 032] Switch eqiad, esams and wikimedia.org puppetmaster2001/puppetdb [dns] - 10https://gerrit.wikimedia.org/r/312007 (owner: 10Alexandros Kosiaris) [14:37:04] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Rocket Surgery 2016: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2656046 (10AndyRussG) [14:37:40] Hi bblack.. Do you have a bit of time to urgently check this out? Maybe a Varnish problem? https://phabricator.wikimedia.org/T144952 [14:37:42] (03CR) 10BBlack: [C: 031] run-no-puppet: do not interpret grep pattern as a regex [puppet] - 10https://gerrit.wikimedia.org/r/312004 (owner: 10Ema) [14:40:40] (03PS1) 10Giuseppe Lavagetto: role::puppetmaster: allow searching hostnames across the fleet [puppet] - 10https://gerrit.wikimedia.org/r/312008 [14:40:50] <_joe_> akosiaris: ^^ [14:41:33] 13 +confd::monitor_files: false [14:41:33] 14 +apache::logrotate::period: "daily" [14:41:33] 15 +apache::logrotate::rotate: 7 [14:41:34] ? [14:41:42] by mistake or in purpose ? [14:41:46] _joe_: ^ [14:41:53] https://gerrit.wikimedia.org/r/#/c/312008/1/hieradata/role/common/puppetmaster/frontend.yaml [14:42:08] <_joe_> what is the error? [14:42:16] <_joe_> on purpose [14:42:18] Anyone able to jump on this? https://phabricator.wikimedia.org/T144952 [14:42:24] _joe_: hi!! ^ [14:42:35] <_joe_> we had everything you see there on palladium [14:42:37] banners for centralnotice showing up bad content [14:42:57] <_joe_> AndyRussG: actually I'm not the right person at the moment, I'm in the middle of a migration [14:43:12] _joe_: ah ok np!! :) [14:44:08] _joe_: ok then.. it's just confd::monitor_files: false, apache::logrotate::period: "daily", apache::logrotate::rotate: 7 are added in a seemingly irrelevant change [14:44:10] <_joe_> akosiaris: maybe the confd:: rule might be moved to the more appropriate role [14:44:14] Actually maybe it's a DB or some other race condition issue.. [14:44:26] AaronSchulz: hi! able to look at this? https://phabricator.wikimedia.org/T144952 [14:44:29] Maybe some DB issue? [14:44:29] <_joe_> but I'm just cutting corners a bit now [14:44:37] ok, merging then [14:44:46] (03CR) 10Alexandros Kosiaris: [C: 032] role::puppetmaster: allow searching hostnames across the fleet [puppet] - 10https://gerrit.wikimedia.org/r/312008 (owner: 10Giuseppe Lavagetto) [14:46:13] PROBLEM - puppet last run on labstore1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:47:38] ^looking [14:48:31] chasemp: it's probably the DNS thing already being fixed up [14:48:44] RECOVERY - puppet last run on labstore1005 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [14:48:47] bblack: got it thanks [14:49:47] <_joe_> chasemp: yes, btw that manifest needs fixing [14:50:34] <_joe_> not now though [14:51:18] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Rocket Surgery 2016: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2656057 (10AndyRussG) Here's the call that's failing: https://github.com/wikimedia/mediawiki-extensions-CentralNoti... [14:52:33] those are in a bit of a transitional phase, but could you comment on https://phabricator.wikimedia.org/T126083 when you have sec about issues [14:52:38] thanks man [14:53:20] AndyRussG: I assume when you say "bad html", this is a code-level problem with html encoding somewhere in MW now, right? [14:55:14] 06Operations, 10Phabricator (Upstream), 07Upstream: phabricator: can't search for RT tickets (reference field) anymore - https://phabricator.wikimedia.org/T146116#2656075 (10Paladox) [15:00:17] 06Operations: setup wmf4747/wmf4748/wmf4749/wmf4750 for temp kubernetes testing - https://phabricator.wikimedia.org/T146171#2656076 (10Cmjohnson) alll servers are labeled with their asset tags and not enabled. [15:03:24] !log installing wireshark security updates [15:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:13:35] (03PS1) 10Elukey: Merge branch 'master' into debian [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/312011 [15:13:37] (03PS1) 10Elukey: Package last upsteam 1.0.1.12-1 [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/312012 [15:14:44] (03CR) 10Alexandros Kosiaris: [C: 032] puppetmaster: servermon report handler [puppet] - 10https://gerrit.wikimedia.org/r/311738 (owner: 10Alexandros Kosiaris) [15:14:49] bblack: sorry, mm yeah looking at it now, it's not Varnish, I don't think. Or at most some race condition in which Varnish might be marginally implicated [15:14:49] (03PS7) 10Alexandros Kosiaris: puppetmaster: servermon report handler [puppet] - 10https://gerrit.wikimedia.org/r/311738 [15:14:51] (03CR) 10Alexandros Kosiaris: [V: 032] puppetmaster: servermon report handler [puppet] - 10https://gerrit.wikimedia.org/r/311738 (owner: 10Alexandros Kosiaris) [15:15:05] bad html is the html content of the banner that is actually an i18n message [15:16:21] (03Abandoned) 10Elukey: Package last upstream 1.0.12-1 [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/311967 (owner: 10Elukey) [15:17:18] 06Operations, 06Release-Engineering-Team, 07Beta-Cluster-reproducible, 07HHVM: Switch mwscript from Zend PHP5 to default php alternative (egHHVM) - https://phabricator.wikimedia.org/T146285#2656106 (10hashar) [15:17:56] 06Operations, 06Release-Engineering-Team, 07Beta-Cluster-reproducible, 07HHVM: Switch mwscript from Zend PHP5 to default php alternative (egHHVM) - https://phabricator.wikimedia.org/T146285#2656124 (10hashar) [15:18:27] 06Operations, 05Goal, 07HHVM: Complete the use of HHVM over Zend PHP on the Wikimedia cluster - https://phabricator.wikimedia.org/T86081#2656132 (10hashar) [15:18:30] 06Operations, 06Release-Engineering-Team, 07Beta-Cluster-reproducible, 07HHVM: Switch mwscript from Zend PHP5 to default php alternative (egHHVM) - https://phabricator.wikimedia.org/T146285#2656106 (10hashar) [15:20:56] AndyRussG: ok, let me know if I can help with anything (that I actually understand or control) :) [15:21:01] (03PS2) 10Giuseppe Lavagetto: puppetmaster: use puppetdb everywhere, configure accordingly [puppet] - 10https://gerrit.wikimedia.org/r/311436 [15:22:00] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster: use puppetdb everywhere, configure accordingly [puppet] - 10https://gerrit.wikimedia.org/r/311436 (owner: 10Giuseppe Lavagetto) [15:23:12] 06Operations, 06Release-Engineering-Team, 07Beta-Cluster-reproducible, 07HHVM: Switch mwscript from Zend PHP5 to default php alternative (egHHVM) - https://phabricator.wikimedia.org/T146285#2656140 (10EBernhardson) [15:24:27] 06Operations, 06Release-Engineering-Team, 07Beta-Cluster-reproducible: mwscript on jessie mediawiki fails; requires php5-memcached and php5-redis - https://phabricator.wikimedia.org/T146286#2656143 (10thcipriani) [15:25:21] (03PS1) 1020after4: `scap patch` tool for applying patches to a wmf/branch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312013 [15:25:28] 06Operations, 06Release-Engineering-Team, 07Beta-Cluster-reproducible: mwscript on jessie mediawiki fails; requires php5-memcached and php5-redis - https://phabricator.wikimedia.org/T146286#2656160 (10thcipriani) [15:25:31] 06Operations, 06Release-Engineering-Team, 07Beta-Cluster-reproducible, 07HHVM: Switch mwscript from Zend PHP5 to default php alternative (egHHVM) - https://phabricator.wikimedia.org/T146285#2656159 (10thcipriani) [15:25:57] 06Operations, 06Release-Engineering-Team, 07Beta-Cluster-reproducible, 07HHVM: Switch mwscript from Zend PHP5 to default php alternative (egHHVM) - https://phabricator.wikimedia.org/T146285#2656106 (10thcipriani) [15:26:49] (03PS1) 10Alexandros Kosiaris: servermon report handler: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/312014 [15:29:09] 06Operations, 06Release-Engineering-Team, 07HHVM, 13Patch-For-Review: Migrate deployment servers (tin/mira) to jessie - https://phabricator.wikimedia.org/T144578#2656179 (10hashar) **status for beta cluster** dpeloyment-mira is the new master running Jessie. The Jenkins jobs are running on it. There are... [15:29:52] 06Operations, 06Release-Engineering-Team, 07Beta-Cluster-reproducible: mwscript on jessie mediawiki fails; requires php5-memcached and php5-redis - https://phabricator.wikimedia.org/T146286#2656143 (10hashar) [15:29:55] 06Operations, 06Release-Engineering-Team, 07HHVM, 13Patch-For-Review: Migrate deployment servers (tin/mira) to jessie - https://phabricator.wikimedia.org/T144578#2656183 (10hashar) [15:31:10] (03CR) 10Elukey: [C: 032 V: 032] Merge branch 'master' into debian [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/312011 (owner: 10Elukey) [15:31:21] 06Operations, 06Release-Engineering-Team, 07Beta-Cluster-reproducible: mwscript on jessie mediawiki fails - https://phabricator.wikimedia.org/T146286#2656188 (10thcipriani) [15:31:24] (03PS1) 10Giuseppe Lavagetto: Revert "Switch eqiad, esams and wikimedia.org puppetmaster2001/puppetdb" [dns] - 10https://gerrit.wikimedia.org/r/312015 [15:31:38] <_joe_> akosiaris: how's the report handler working? [15:31:39] <_joe_> :) [15:32:16] merging a change now [15:32:31] btw.. puppet dashboard kills the CPUs on my laptop [15:32:39] (03CR) 10Alexandros Kosiaris: [C: 032] servermon report handler: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/312014 (owner: 10Alexandros Kosiaris) [15:32:43] (03PS2) 10Alexandros Kosiaris: servermon report handler: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/312014 [15:32:45] (03CR) 10Alexandros Kosiaris: [V: 032] servermon report handler: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/312014 (owner: 10Alexandros Kosiaris) [15:32:52] (03PS2) 10BBlack: text frontend VCL: copy 4-hit-wonder from upload [puppet] - 10https://gerrit.wikimedia.org/r/311994 [15:33:04] (03CR) 10BBlack: [C: 032 V: 032] text frontend VCL: copy 4-hit-wonder from upload [puppet] - 10https://gerrit.wikimedia.org/r/311994 (owner: 10BBlack) [15:33:06] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [15:33:56] akosiaris: merged yours [15:34:10] ok, thanks [15:34:52] (03PS32) 1020after4: Scap swat command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) [15:34:54] (03PS1) 1020after4: WIP: `scap scrape` plugin split out from change 306259 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312016 [15:35:10] (03PS33) 1020after4: Scap swat command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) [15:35:12] (03PS2) 1020after4: `scap patch` tool for applying patches to a wmf/branch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312013 [15:35:36] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4312463 keys - replication_delay is 0 [15:37:20] (03CR) 10Giuseppe Lavagetto: [C: 032] Revert "Switch eqiad, esams and wikimedia.org puppetmaster2001/puppetdb" [dns] - 10https://gerrit.wikimedia.org/r/312015 (owner: 10Giuseppe Lavagetto) [15:37:54] (03PS2) 10Hashar: beta: switch deploy server to deployment-mira [puppet] - 10https://gerrit.wikimedia.org/r/311947 (https://phabricator.wikimedia.org/T144578) [15:38:07] (03PS3) 10Hashar: beta: switch deploy server to deployment-mira [puppet] - 10https://gerrit.wikimedia.org/r/311947 (https://phabricator.wikimedia.org/T144578) [15:38:08] <_joe_> akosiaris: wanna purge the recursors in eqiad? [15:38:18] doing so now [15:38:57] wipe-cache done [15:39:02] (03CR) 10Muehlenhoff: [C: 031] beta: switch deploy server to deployment-mira [puppet] - 10https://gerrit.wikimedia.org/r/311947 (https://phabricator.wikimedia.org/T144578) (owner: 10Hashar) [15:39:14] requests are flowing in :-) [15:40:35] (03PS4) 10Hashar: beta: switch deploy server to deployment-mira [puppet] - 10https://gerrit.wikimedia.org/r/311947 (https://phabricator.wikimedia.org/T144578) [15:41:36] (03PS34) 1020after4: Scap swat command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) [15:42:43] (03PS2) 1020after4: WIP: `scap scrape` plugin split out from change 306259 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312016 [15:44:10] (03CR) 1020after4: "I've separated out the part that scrapes the deployment page, `scap scrape` is now a separate change with the `scap-plugins` topic" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) (owner: 1020after4) [15:44:13] (03CR) 10Elukey: [C: 032 V: 032] Package last upsteam 1.0.1.12-1 [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/312012 (owner: 10Elukey) [15:44:27] 06Operations, 10ops-eqiad, 06DC-Ops, 06Discovery-Search: elastic1027 does not reboot - https://phabricator.wikimedia.org/T146268#2656257 (10Gehel) @Cmjohnson Thanks! It looks all good from my side as well! [15:52:38] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:56:53] <_joe_> ah right, palladium [15:57:01] <_joe_> fixing in a few [15:59:04] (03CR) 10Thcipriani: [C: 031] beta: switch deploy server to deployment-mira [puppet] - 10https://gerrit.wikimedia.org/r/311947 (https://phabricator.wikimedia.org/T144578) (owner: 10Hashar) [16:06:24] (03PS2) 10BBlack: Remove bits.wikimedia.org from DNS [dns] - 10https://gerrit.wikimedia.org/r/305533 (https://phabricator.wikimedia.org/T107430) [16:08:14] 06Operations, 10Traffic, 13Patch-For-Review: Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#2656339 (10BBlack) The proposed removal date was 2 days ago, I've just been busy with other things. Will merge removal today unless objections/alternatives as above. Ping @Krinkle . K... [16:12:06] RECOVERY - mediawiki-installation DSH group on mw1172 is OK: OK [16:12:22] (03PS1) 10Elukey: Update Debian specific settings [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/312020 [16:12:33] (03Abandoned) 10Elukey: Update Debian specific settings [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/312020 (owner: 10Elukey) [16:15:38] (03PS1) 10Elukey: Update Debian specific settings [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/312021 [16:18:21] (03PS2) 10Elukey: Update Debian specific settings [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/312021 [16:21:29] (03PS3) 10Elukey: Update Debian specific settings [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/312021 [16:25:54] 06Operations, 06Release-Engineering-Team, 07Beta-Cluster-reproducible: mwscript on jessie mediawiki fails - https://phabricator.wikimedia.org/T146286#2656384 (10hashar) https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11504/ fails due to Flow eventually invoking `curl_multi_init()` Looks... [16:26:09] (03PS1) 10Rush: labstore: drbd resource setup sanity [puppet] - 10https://gerrit.wikimedia.org/r/312023 [16:27:33] (03CR) 10jenkins-bot: [V: 04-1] labstore: drbd resource setup sanity [puppet] - 10https://gerrit.wikimedia.org/r/312023 (owner: 10Rush) [16:29:03] (03PS2) 10Rush: labstore: drbd resource setup sanity [puppet] - 10https://gerrit.wikimedia.org/r/312023 [16:29:13] (03PS1) 10Giuseppe Lavagetto: palladium: remove role::puppetmaster::frontend [puppet] - 10https://gerrit.wikimedia.org/r/312024 [16:29:15] (03PS1) 10Giuseppe Lavagetto: palladium: remove role::pybal::config [puppet] - 10https://gerrit.wikimedia.org/r/312025 [16:30:16] !log restbase deploy start of a75510d [16:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:31:40] (03PS1) 10Giuseppe Lavagetto: Remove puppetmaster.test [dns] - 10https://gerrit.wikimedia.org/r/312026 [16:32:50] (03CR) 10Giuseppe Lavagetto: [C: 032] Remove puppetmaster.test [dns] - 10https://gerrit.wikimedia.org/r/312026 (owner: 10Giuseppe Lavagetto) [16:36:13] (03PS4) 10Elukey: Update Debian specific settings [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/312021 [16:37:09] (03CR) 10BBlack: [C: 031] labstore: drbd resource setup sanity [puppet] - 10https://gerrit.wikimedia.org/r/312023 (owner: 10Rush) [16:38:32] (03PS1) 10Giuseppe Lavagetto: hieradata: cleanup palladium/strontium [puppet] - 10https://gerrit.wikimedia.org/r/312029 [16:40:40] (03CR) 10Ema: [C: 031] Update Debian specific settings [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/312021 (owner: 10Elukey) [16:41:15] (03CR) 10Elukey: [C: 032 V: 032] Update Debian specific settings [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/312021 (owner: 10Elukey) [16:41:33] (03CR) 10Giuseppe Lavagetto: [C: 032] palladium: remove role::puppetmaster::frontend [puppet] - 10https://gerrit.wikimedia.org/r/312024 (owner: 10Giuseppe Lavagetto) [16:44:34] (03PS1) 10BryanDavis: Add ruby images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/312033 (https://phabricator.wikimedia.org/T141388) [16:45:57] (03PS1) 10Alexandros Kosiaris: puppetmaster: wrap servermon report handler in transactions [puppet] - 10https://gerrit.wikimedia.org/r/312034 [16:46:37] !log restbase deploy end of a75510d [16:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:46:44] !log running P3833 script against designate to clean up existing T120797 mess [16:46:45] T120797: Clean up leaked designate entries - https://phabricator.wikimedia.org/T120797 [16:46:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:47:15] (03PS2) 10Alexandros Kosiaris: puppetmaster: wrap servermon report handler in transactions [puppet] - 10https://gerrit.wikimedia.org/r/312034 [16:47:18] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] puppetmaster: wrap servermon report handler in transactions [puppet] - 10https://gerrit.wikimedia.org/r/312034 (owner: 10Alexandros Kosiaris) [16:47:36] (03CR) 10Giuseppe Lavagetto: [C: 032] palladium: remove role::pybal::config [puppet] - 10https://gerrit.wikimedia.org/r/312025 (owner: 10Giuseppe Lavagetto) [16:47:42] (03PS2) 10Giuseppe Lavagetto: palladium: remove role::pybal::config [puppet] - 10https://gerrit.wikimedia.org/r/312025 [16:47:49] (03CR) 10Giuseppe Lavagetto: [V: 032] palladium: remove role::pybal::config [puppet] - 10https://gerrit.wikimedia.org/r/312025 (owner: 10Giuseppe Lavagetto) [16:48:07] (03PS2) 10Giuseppe Lavagetto: hieradata: cleanup palladium/strontium [puppet] - 10https://gerrit.wikimedia.org/r/312029 [16:48:23] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] hieradata: cleanup palladium/strontium [puppet] - 10https://gerrit.wikimedia.org/r/312029 (owner: 10Giuseppe Lavagetto) [16:49:16] (03PS3) 10Giuseppe Lavagetto: hieradata: cleanup palladium/strontium [puppet] - 10https://gerrit.wikimedia.org/r/312029 [16:49:45] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] hieradata: cleanup palladium/strontium [puppet] - 10https://gerrit.wikimedia.org/r/312029 (owner: 10Giuseppe Lavagetto) [16:51:15] PROBLEM - puppetmaster https on palladium is CRITICAL: Connection refused [16:51:36] <_joe_> oh shit [16:51:58] <_joe_> I forgot to run puppet befor running my latest puppet-merge [16:52:00] <_joe_> heh [16:53:17] PROBLEM - puppetmaster backend https on palladium is CRITICAL: Connection refused [16:53:56] <_joe_> all of this ^^ is expected, I forgot to downtime the services [16:55:07] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:55:33] 06Operations, 06Labs, 10Labs-Infrastructure, 07Wikimedia-Incident: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#2656525 (10AlexMonk-WMF) The script was run against real-labs in T120797 and most existing problem cases should be gone now [17:00:20] (03PS1) 10Yuvipanda: puppet: Enable ENC on trusty modules too [puppet] - 10https://gerrit.wikimedia.org/r/312044 (https://phabricator.wikimedia.org/T91990) [17:08:33] PROBLEM - puppet last run on mw2231 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:10:21] (03PS1) 10Ori.livneh: [WIP] Module for Recommendation API [puppet] - 10https://gerrit.wikimedia.org/r/312045 [17:10:33] PROBLEM - puppet last run on mw1176 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:31:30] (03CR) 10Daniel Kinzler: [C: 031] "Seems fine except for the trailing whitespace. Also, I think we are missing micrometer for some reason." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311206 (https://phabricator.wikimedia.org/T117032) (owner: 10Smalyshev) [17:33:18] (03CR) 10Smalyshev: "these are filtered by usage, so micrometer is probably not used" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311206 (https://phabricator.wikimedia.org/T117032) (owner: 10Smalyshev) [17:34:43] 06Operations, 10Traffic, 13Patch-For-Review: Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#2656693 (10BBlack) Same data logging as back on Sep 7, but using Sept 21 data. Not much change in the overall, and still close to the same overall level (~1.46% of all requests): ``` l... [17:34:49] (03CR) 10Smalyshev: "My unit tool (https://www.wikidata.org/wiki/User:Laboramus/Units/P2043) shows only 3 uses of micrometer, right now unit config generator i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311206 (https://phabricator.wikimedia.org/T117032) (owner: 10Smalyshev) [17:35:14] RECOVERY - puppet last run on mw1176 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [17:38:04] PROBLEM - puppet last run on ms-be1022 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 14 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[xfs_label-/dev/sdb3],Exec[mkfs-/dev/sdc1] [17:39:38] (03CR) 10BBlack: [C: 032] Remove bits.wikimedia.org from DNS [dns] - 10https://gerrit.wikimedia.org/r/305533 (https://phabricator.wikimedia.org/T107430) (owner: 10BBlack) [17:39:43] (03PS3) 10BBlack: Remove bits.wikimedia.org from DNS [dns] - 10https://gerrit.wikimedia.org/r/305533 (https://phabricator.wikimedia.org/T107430) [17:39:45] (03CR) 10BBlack: [V: 032] Remove bits.wikimedia.org from DNS [dns] - 10https://gerrit.wikimedia.org/r/305533 (https://phabricator.wikimedia.org/T107430) (owner: 10BBlack) [17:40:29] !log installed varnishkafka 1.0.12-1 on cp3034.esams (T138747) [17:40:57] T138747: Varnishkafka should auto-reconnect to abandoned VSM - https://phabricator.wikimedia.org/T138747 [17:41:02] (03PS2) 10Yuvipanda: puppet: Enable ENC on trusty nodes too [puppet] - 10https://gerrit.wikimedia.org/r/312044 (https://phabricator.wikimedia.org/T91990) [17:41:09] !log bits.wikimedia.org hostname removed from DNS (if related real complaints/problems occur, revert https://gerrit.wikimedia.org/r/305533 ) [17:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:41:24] bblack: nice! \o/ [17:42:16] I'm somewhat apprehensive that there will be some kind of complaint, but at this point I don't know a better way to shake out those complaints and find what to fix. We've put a lot effort into tracking down and killing refs to it over the past several months [17:45:19] 06Operations, 10Mail, 10Phabricator: Phabricator emails failing spf - https://phabricator.wikimedia.org/T146299#2656753 (10Krenair) Something strange is going on here, I got an SPF pass on that message to my @gmail.com address - why does google use the origin server's IP for the SPF check instead of the rela... [17:49:20] Who would be responsible for the part of infrastructure that serves i18n messages [17:49:44] Just 4 reference (about to loose connectivity): https://phabricator.wikimedia.org/T144952 [17:59:40] (03PS1) 10Alexandros Kosiaris: Disable puppetDB everywhere [puppet] - 10https://gerrit.wikimedia.org/r/312050 [18:00:04] anomie, ostriches, thcipriani, hashar, and twentyafterfour: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160921T1800). [18:00:42] (03CR) 10Yuvipanda: [C: 032] puppet: Enable ENC on trusty nodes too [puppet] - 10https://gerrit.wikimedia.org/r/312044 (https://phabricator.wikimedia.org/T91990) (owner: 10Yuvipanda) [18:00:47] nothing posted for SWAT. [18:02:32] hm, so what is the status of wmf.19 and 20? or is it 21 already? [18:03:57] MatmaRex: current plan is outlined here: https://phabricator.wikimedia.org/T144644#2654240 [18:04:12] going to try to get wmf.20 to group0 wikis during the upcoming deploy window. [18:04:22] thanks [18:14:36] (03CR) 10Madhuvishy: [C: 031] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/312023 (owner: 10Rush) [18:16:51] brion: so the plan with the deployment train is to move forward with wmf.20 to group0 during the next deployment window. Looked through the SWATs that have happened since cutting wmf.20 and now and I think I need to backport this patch to wmf.20: https://gerrit.wikimedia.org/r/#/c/311852/ does that look correct to you? [18:18:31] aude: I saw you merged something during the european SWAT window to wikidata wmf.18, but I remember you saying something about wikidata and wmf.20. Is wmf.20 fine with wikidata wmf.18? [18:18:59] (03PS1) 10Alexandros Kosiaris: puppetmaster/puppetdb: Make ferm rules better [puppet] - 10https://gerrit.wikimedia.org/r/312054 [18:19:54] thcipriani: aude is probably on a plane [18:20:37] Lydia_WMDE: oh! thank you for letting me know. [18:20:43] np [18:22:06] (03CR) 10jenkins-bot: [V: 04-1] puppetmaster/puppetdb: Make ferm rules better [puppet] - 10https://gerrit.wikimedia.org/r/312054 (owner: 10Alexandros Kosiaris) [18:22:49] Lydia_WMDE: are you aware of any compatibility issues between wikidata wmf.18 and mediawiki (and other extensions) wmf.20? I only have a vague memory of aude mentioning something (it may already be taken care of) [18:23:18] thcipriani: unfortunately not. i am lacking a lot of info this week unfortunately [18:23:36] ok, no problem. [18:24:12] (03PS1) 10BBlack: phab SPF: add iridium private IPs [dns] - 10https://gerrit.wikimedia.org/r/312058 (https://phabricator.wikimedia.org/T146299) [18:24:31] thcipriani Hi, i backported it here https://gerrit.wikimedia.org/r/#/c/312055/ for tmh [18:25:27] (03CR) 10Alex Monk: [C: 031] phab SPF: add iridium private IPs [dns] - 10https://gerrit.wikimedia.org/r/312058 (https://phabricator.wikimedia.org/T146299) (owner: 10BBlack) [18:25:44] (03CR) 10Paladox: [C: 031] phab SPF: add iridium private IPs [dns] - 10https://gerrit.wikimedia.org/r/312058 (https://phabricator.wikimedia.org/T146299) (owner: 10BBlack) [18:26:32] yuvipanda: got your ping about limn1, thank you. [18:27:38] yuvipanda: i really would like us to kill the instance now and see if anyone complains, i doubt there are actively looked at things there that are not much outdated. [18:28:01] nuria_: sure. we can shut it off now, and delete it in a month or something [18:28:32] yuvipanda: I know i brought this up before with team and someone had a good objection .. but for my life i cannot remember now [18:28:42] :) [18:31:31] yuvipanda: created task on kanban: https://phabricator.wikimedia.org/T146308 [18:31:49] yuvipanda: will add it to operational excellence for next quarter cc milimetric [18:32:36] ok! [18:34:04] (03CR) 10BBlack: [C: 032] phab SPF: add iridium private IPs [dns] - 10https://gerrit.wikimedia.org/r/312058 (https://phabricator.wikimedia.org/T146299) (owner: 10BBlack) [18:39:20] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Rocket Surgery 2016: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2615300 (10awight) a:03awight [18:51:33] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Rack/Setup pay-lvs1003 and pay-lvs1004 - https://phabricator.wikimedia.org/T143900#2657030 (10Jgreen) [19:00:04] thcipriani: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160921T1900). [19:01:58] ok, I'm going to merge the backport to TimedMediaHandler (thanks paladox ) as I think that's the only backport that is in wmf.18/19 and not in wmf.20, and then I'm going to start the l10n rebuild for wmf.20. [19:02:16] Your welcome :) [19:03:19] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Rack/Setup pay-lvs1003 and pay-lvs1004 - https://phabricator.wikimedia.org/T143900#2657079 (10Jgreen) @Cmjohnson could you take a look at BIOS/ILOM settings for pay-lvs1004? I tried to pxeboot-image it and the process kept croaking immediately after fetching... [19:03:32] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Rack/Setup pay-lvs1003 and pay-lvs1004 - https://phabricator.wikimedia.org/T143900#2657080 (10Jgreen) p:05Triage>03Normal [19:06:24] PROBLEM - puppet last run on mw1248 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:06:33] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:07:54] thcipriani: Once done I'll deploy https://gerrit.wikimedia.org/r/312065 [19:09:05] Krinkle: still waiting on jenkins if you want to get that out now. If not, I can ping you when scap is done. [19:09:21] OK. I'll push it now :) [19:12:13] (03PS3) 10Andrew Bogott: openstack: Import nova_ldap designate plugin [puppet] - 10https://gerrit.wikimedia.org/r/308875 (https://phabricator.wikimedia.org/T144317) (owner: 10Alex Monk) [19:20:22] 06Operations, 06Operations-Software-Development: Evaluation of automation/orchestration tools - https://phabricator.wikimedia.org/T143306#2657113 (10Matanya) [19:20:35] (03CR) 10Andrew Bogott: [C: 032] openstack: Import nova_ldap designate plugin [puppet] - 10https://gerrit.wikimedia.org/r/308875 (https://phabricator.wikimedia.org/T144317) (owner: 10Alex Monk) [19:20:38] 06Operations, 06Operations-Software-Development: Evaluation of automation/orchestration tools - https://phabricator.wikimedia.org/T143306#2563658 (10Matanya) [19:22:28] Syncing [19:23:22] (03PS1) 10Hashar: rpc: trick mw into generating a raw exception report [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312077 [19:24:07] !log krinkle@tin Synchronized php-1.28.0-wmf.18/resources/src/mediawiki/mediawiki.js: T146099 (duration: 01m 41s) [19:24:07] T146099: mw-1.28.0-wmf.18 load-time regression - https://phabricator.wikimedia.org/T146099 [19:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:26:23] (03CR) 10Hashar: "Aaron that one is pretty lame but MWExceptionHandler::handleException() is hardcoded to use the prettified renderer :(" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312077 (owner: 10Hashar) [19:30:57] Krinkle: clear for full scap? [19:31:03] RECOVERY - puppet last run on mw1248 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [19:31:06] thcipriani: yes [19:32:19] !log thcipriani@tin Started scap: testwiki to php-1.28.0-wmf.20 and rebuild l10n cache [19:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:32:27] cool, thank you :) [19:33:32] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:36:12] (03PS2) 10BBlack: upload storage: transition cp1048+cp1049 [puppet] - 10https://gerrit.wikimedia.org/r/311996 [19:36:33] thcipriani: I'm trying to slip a CentralNotice change into the train deployment... Is this a good time to do so? [19:36:59] awight: what change? [19:37:19] It's tiny: //gerrit.wikimedia.org/r/312074 [19:37:23] aach https://gerrit.wikimedia.org/r/312074 [19:37:56] (03CR) 10BBlack: [C: 032] upload storage: transition cp1048+cp1049 [puppet] - 10https://gerrit.wikimedia.org/r/311996 (owner: 10BBlack) [19:38:22] awight: phew, ok, was worried it might have l10n stuff. You want this on the branch going to group0? wmf.20? The goal is to have that on all wikis by end of week? [19:38:36] er s/\?$/./ [19:38:53] yah actually it's unbreak now for us, so I might try to do a lightning deployment to wmf.19 while we're waiting... [19:39:01] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: access request for 'researchers' and 'analytics-users' for Deborah Tankersley - https://phabricator.wikimedia.org/T145914#2657165 (10Dzahn) [19:39:25] Cool, I'm reading the 1.28 roadmap now, yeah group 1 won't quite be enuf [19:39:42] wmf.19 isn't on any wikis, we're jumping over that one straight to wmf.20. [19:39:50] ooh, thx [19:40:00] (03PS1) 10Dzahn: admin: add user debt to researchers and analytics-users [puppet] - 10https://gerrit.wikimedia.org/r/312083 (https://phabricator.wikimedia.org/T145914) [19:40:11] So, I think I'll just schedule a deployment for 21:00 UTC, once you're done with the train. [19:40:18] greg-g: ^ with your blessings [19:40:26] awight: kk, just started scap, might be a bit before that's complete. Can we backport this change to both wmf.18 and wmf.20 and I can get it out? OK, or that :) [19:40:44] I'll make the patches and then see where u're at [19:40:59] No problem waiting the hour and doing it myself [19:41:09] (03PS4) 10Andrew Bogott: openstack: Import nova_ldap designate plugin [puppet] - 10https://gerrit.wikimedia.org/r/308875 (https://phabricator.wikimedia.org/T144317) (owner: 10Alex Monk) [19:41:32] awight: sounds good. I'll ping you when scap completes. Rebuilding l10n takes...a while (30–45 minutes). [19:41:44] well, rebuild, sync, the whole thing. [19:41:51] it's not so bad these days! [19:47:02] 06Operations, 10ops-eqiad: dbstore1001: check drive bays - https://phabricator.wikimedia.org/T145389#2657204 (10RobH) [20:00:05] gwicke, cscott, arlolra, subbu, bearND, mdholloway, halfak, and Amir1: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160921T2000). Please do the needful. [20:00:18] no ores for today [20:02:45] 06Operations, 10Mail, 10Phabricator, 13Patch-For-Review: Phabricator emails failing spf - https://phabricator.wikimedia.org/T146299#2656694 (10BBlack) ping [20:03:41] 06Operations, 10Mail, 10Phabricator, 13Patch-For-Review: Phabricator emails failing spf - https://phabricator.wikimedia.org/T146299#2657255 (10BBlack) 05Open>03Resolved a:03BBlack The email sent from my ping nows says `Received-SPF: pass (google.com: domain of no-reply@phabricator.wikimedia.org desig... [20:04:51] !log starting Parsoid deploy [20:05:21] (03CR) 10Andrew Bogott: [C: 032] Follow-up I026e4f57: Fix style [puppet] - 10https://gerrit.wikimedia.org/r/308876 (owner: 10Alex Monk) [20:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:05:21] (03PS3) 10Andrew Bogott: Follow-up I026e4f57: Fix style [puppet] - 10https://gerrit.wikimedia.org/r/308876 (owner: 10Alex Monk) [20:05:45] (03PS2) 10Dzahn: admin: add user debt to researchers and analytics-users [puppet] - 10https://gerrit.wikimedia.org/r/312083 (https://phabricator.wikimedia.org/T145914) [20:05:49] awight|mtg: blessed [20:07:22] (03CR) 10Lydia Pintscher: [C: 031] "> (1 comment)" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311206 (https://phabricator.wikimedia.org/T117032) (owner: 10Smalyshev) [20:09:34] (03PS4) 10Smalyshev: Add config for units on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311206 (https://phabricator.wikimedia.org/T117032) [20:11:32] (03CR) 10Dzahn: [C: 032] admin: add user debt to researchers and analytics-users [puppet] - 10https://gerrit.wikimedia.org/r/312083 (https://phabricator.wikimedia.org/T145914) (owner: 10Dzahn) [20:11:41] (03PS3) 10Dzahn: admin: add user debt to researchers and analytics-users [puppet] - 10https://gerrit.wikimedia.org/r/312083 (https://phabricator.wikimedia.org/T145914) [20:15:20] (03CR) 10Daniel Kinzler: Add config for units on Wikidata (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311206 (https://phabricator.wikimedia.org/T117032) (owner: 10Smalyshev) [20:17:19] !log updated Parsoid to version a802de0 [20:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:18:29] (03PS5) 10Smalyshev: Add config for units on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311206 (https://phabricator.wikimedia.org/T117032) [20:19:37] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: access request for 'researchers' and 'analytics-users' for Deborah Tankersley - https://phabricator.wikimedia.org/T145914#2657299 (10Dzahn) @mpopob @debt Alright, thanks. I added you to the researchers and the analytics-users groups. Your user has b... [20:19:58] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: access request for 'researchers' and 'analytics-users' for Deborah Tankersley - https://phabricator.wikimedia.org/T145914#2657300 (10Dzahn) 05Open>03Resolved a:03Dzahn [20:20:48] (03PS2) 10Dzahn: varnish/htcppurger: don't use ensure => latest [puppet] - 10https://gerrit.wikimedia.org/r/310895 (https://phabricator.wikimedia.org/T115348) [20:21:02] (03CR) 10Dzahn: [C: 032] varnish/htcppurger: don't use ensure => latest [puppet] - 10https://gerrit.wikimedia.org/r/310895 (https://phabricator.wikimedia.org/T115348) (owner: 10Dzahn) [20:23:57] thcipriani, is it ok for me to depl kartotherian scap3 service [20:24:21] !log thcipriani@tin Finished scap: testwiki to php-1.28.0-wmf.20 and rebuild l10n cache (duration: 52m 01s) [20:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:24:43] yurik: should be fine: different lock files different deploy methods. [20:24:49] cool [20:26:04] (03PS2) 10Dzahn: remove titanium.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/310599 (https://phabricator.wikimedia.org/T145666) [20:26:39] (03PS3) 10Andrew Bogott: shinkengen: Remove use of puppetVars [puppet] - 10https://gerrit.wikimedia.org/r/309008 (owner: 10Alex Monk) [20:27:59] (03CR) 10Dzahn: [C: 032] "this host has been shutdown last week" [dns] - 10https://gerrit.wikimedia.org/r/310599 (https://phabricator.wikimedia.org/T145666) (owner: 10Dzahn) [20:29:41] (03CR) 10Andrew Bogott: [C: 032] shinkengen: Remove use of puppetVars [puppet] - 10https://gerrit.wikimedia.org/r/309008 (owner: 10Alex Monk) [20:29:46] (03CR) 10Dzahn: [C: 031] ganglia: ship native systemd service unit [puppet] - 10https://gerrit.wikimedia.org/r/311970 (https://phabricator.wikimedia.org/T144778) (owner: 10Filippo Giunchedi) [20:30:41] (03PS1) 10Thcipriani: Group0 to 1.28.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312104 [20:31:19] 06Operations, 10ops-eqiad, 10Analytics-Cluster: decom titanium - https://phabricator.wikimedia.org/T145666#2657378 (10Dzahn) [20:31:43] !log starting mobileapps deploy [20:31:45] (03CR) 10Thcipriani: [C: 032] Group0 to 1.28.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312104 (owner: 10Thcipriani) [20:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:31:52] thcipriani: hold on that one please [20:31:59] idoine: kk [20:32:02] wanna verify we can still create an account on testwiki :D [20:32:11] (03Merged) 10jenkins-bot: Group0 to 1.28.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312104 (owner: 10Thcipriani) [20:32:43] bah bytecode cache is not primed :( [20:32:45] that is slow [20:32:58] wgBackendResponseTime":13460 :D [20:33:13] yeap [20:33:25] got an account created on testwiki :D [20:34:01] me too! :) [20:34:04] thcipriani: that was just a smoke test :] [20:34:06] good to me [20:34:07] ok, going out to group0 [20:34:29] that is where we should do what zeljkof proposed: run smoke tests against testwiki [20:35:05] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: access request for 'researchers' and 'analytics-users' for Deborah Tankersley - https://phabricator.wikimedia.org/T145914#2657413 (10debt) Thanks, @Dzahn! [20:35:12] thcipriani: scap deploy asks me this: "canary deploy successful. Continue? [y]es/[n]o/[c]ontinue all groups". I'm not sure what the difference between 'y' and 'c' answers is. [20:35:23] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to 1.28.0-wmf.20 [20:35:36] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:35:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:36:01] bearND: so if you answer 'c' it will not prompt you to roll forward again during the current deployment, no matter how many groups you have defined [20:36:15] if you only have a canary group and a default group there is no difference between y and c [20:36:19] !log deployed kartotherian geoshape lines support - https://gerrit.wikimedia.org/r/#/c/312097/ [20:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:36:30] thcipriani, i'm done [20:36:40] thcipriani: ok. thanks. So, 'y' seems like the safe bet [20:36:45] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, 13Patch-For-Review: CN: Stop using the geoiplookup HTTPS service (always use the Cookie) - https://phabricator.wikimedia.org/T143271#2657416 (10DStrine) [20:36:58] bearND: yeah, 'y' only has the potential to be annoying, not harmful :) [20:37:05] annoying for the deployer [20:37:16] yurik: cool :) [20:39:46] !log deployed mobileapps bf6943b [20:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:40:07] train is complete and verified wmf.20 on group0, wmf.18 everywhere else [20:40:14] PROBLEM - Varnishkafka log producer on cp3046 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [20:40:19] awight|mtg: ^ is all clear for you to deploy backports [20:40:31] thcipriani: how's the graphs and such? [20:40:40] you know, "metrics" [20:40:55] (ignore my odd state of mind) [20:41:06] :) [20:41:07] Metric 1 Green Metric 2 Green Metric 3 Green Metric 4 (stretch goal) orange [20:41:15] yup [20:41:50] 06Operations, 10Ops-Access-Requests: Access to people.wikimedia.org for Volker_E - https://phabricator.wikimedia.org/T143465#2657444 (10Dzahn) 05Open>03stalled [20:42:02] heh, logs look good. Probably too small of a change to be very impactful anywhere else. I'm looking through grafana and all seems ok. [20:42:48] looking at group0 fatalmonitor there doesn't seem to be any major shift in error rate since the deployment. [20:43:18] I still need to update the roadmap :\ [20:46:24] 06Operations, 10Ops-Access-Requests: Access to people.wikimedia.org for Volker_E - https://phabricator.wikimedia.org/T143465#2657452 (10Dzahn) I have pinged multiple times but didn't get a response. I suggest we close the ticket at the end end of the week and reopen it when/if actually needed. There is also st... [20:47:14] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Request for access to stat1003 for Sam Walton - https://phabricator.wikimedia.org/T145788#2640942 (10Ocaasi) @Dzahn You have my approval as Sam's manager. He is a contractor for The Wikipedia Library program in Community Engagement, which I run. [20:50:10] (03PS1) 10Alex Monk: shinkengen: get role classes from puppet enc too [puppet] - 10https://gerrit.wikimedia.org/r/312109 [20:52:33] RECOVERY - Varnishkafka log producer on cp3046 is OK: PROCS OK: 1 process with command name varnishkafka [20:56:55] (03CR) 10Jhobs: [C: 031] "I don't //think// you need to do the whole wmg->wg conversion anymore (can just do wg in InitialiseSettings), but it also doesn't hurt any" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311197 (https://phabricator.wikimedia.org/T144912) (owner: 10Bmansurov) [20:59:02] thcipriani: if it didn't make the branch point yeah [20:59:32] brion: yup, it got backported to wmf.20 pre this previous deploy. [20:59:38] ah great :D [20:59:39] thx! [20:59:40] (03CR) 10Daniel Kinzler: [C: 031] "This *looks* like what we want, but I can't vouch for it actually working. I also don't have +2 here, so +1 is all i can do :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311206 (https://phabricator.wikimedia.org/T117032) (owner: 10Smalyshev) [21:00:06] pre the previous [21:00:29] I stand by that statement [21:00:57] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [21:08:00] thcipriani: we aim at wmf.20 tomorrow right ? [21:08:12] will check with wikidata folks tomorrow [21:08:33] idoine: that is the plan, shortened train tomorrow. [21:09:55] sounds good [21:10:20] maybe later we can considerate moving wikidata to its own group [21:10:25] during european business day [21:16:55] sleep well all [21:17:07] * greg-g waves [21:20:59] (03PS1) 10Alex Monk: designate-sink nova_ldap: set l to the correct site [puppet] - 10https://gerrit.wikimedia.org/r/312115 [21:30:24] 06Operations, 10Mail, 10Phabricator, 13Patch-For-Review: Phabricator emails failing spf - https://phabricator.wikimedia.org/T146299#2656694 (10greg) I bet this was a dupe of {T144381} [21:32:57] (03PS4) 10Dzahn: admin: create shell account for Sam Walton [puppet] - 10https://gerrit.wikimedia.org/r/311473 (https://phabricator.wikimedia.org/T145788) [21:37:31] (03CR) 10Dzahn: [C: 032] "This was just pending manager approval, which it got now." [puppet] - 10https://gerrit.wikimedia.org/r/311473 (https://phabricator.wikimedia.org/T145788) (owner: 10Dzahn) [21:41:20] (03PS3) 10Dzahn: admin: add samwalton9 to researchers [puppet] - 10https://gerrit.wikimedia.org/r/311480 (https://phabricator.wikimedia.org/T145788) [21:43:20] (03CR) 10Dzahn: [C: 032] admin: add samwalton9 to researchers [puppet] - 10https://gerrit.wikimedia.org/r/311480 (https://phabricator.wikimedia.org/T145788) (owner: 10Dzahn) [21:44:52] 06Operations, 10Ops-Access-Requests: Access to people.wikimedia.org for Volker_E - https://phabricator.wikimedia.org/T143465#2657650 (10Volker_E) @Dzahn, I've replied to you directly via IRC?! Reason was, I haven't had capacity to care about that until now. Anyways, I just signed the Acknowledgement. cc: @Ari... [21:46:24] 06Operations, 10Ops-Access-Requests: Access to people.wikimedia.org for Volker_E - https://phabricator.wikimedia.org/T143465#2657654 (10Paladox) 05stalled>03Open [21:46:46] (03PS2) 10BBlack: upload storage: transition cp1050+cp1062 [puppet] - 10https://gerrit.wikimedia.org/r/311997 [21:47:49] (03CR) 10BBlack: [C: 032] upload storage: transition cp1050+cp1062 [puppet] - 10https://gerrit.wikimedia.org/r/311997 (owner: 10BBlack) [21:49:26] PROBLEM - puppet last run on mw1254 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:52:01] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Request for access to stat1003 for Sam Walton - https://phabricator.wikimedia.org/T145788#2657662 (10Dzahn) @Ocassi Thanks for the approval, alright. @SamWalton9 Your user has now been created on stat1003 and the bastion hosts and you have been added... [21:52:07] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Request for access to stat1003 for Sam Walton - https://phabricator.wikimedia.org/T145788#2657663 (10Dzahn) 05Open>03Resolved a:03Dzahn [21:52:49] 06Operations, 10Ops-Access-Requests: Access to people.wikimedia.org for Volker_E - https://phabricator.wikimedia.org/T143465#2657668 (10Dzahn) a:03Dzahn [21:58:46] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [21:58:57] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [22:00:31] cache_upload issues, resolving it already [22:06:17] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:09:19] (03CR) 10Bmansurov: "The change is being SWAT deployed: https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=855407&oldid=853021" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311197 (https://phabricator.wikimedia.org/T144912) (owner: 10Bmansurov) [22:13:50] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:13:56] RECOVERY - puppet last run on mw1254 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [22:15:46] greg-g: May I grab this hour for a small CentralNotice tweak? [22:16:11] I'd also like to push a change to get a new wfDebugLog bucket [22:17:38] thcipriani: ^ heads-up that I'm gonna try this deployment now. [22:18:31] awight: sounds good :) [22:23:57] 06Operations: setup wmf4747/wmf4748/wmf4749/wmf4750 for temp kubernetes testing - https://phabricator.wikimedia.org/T146171#2657693 (10RobH) [22:23:59] 06Operations, 10ops-eqiad: get port info for wmf4747/wmf4748/wmf4749/wmf4750 - https://phabricator.wikimedia.org/T146172#2657692 (10RobH) 05Open>03Resolved [22:25:36] (03PS1) 10Awight: Capture the "CentralNotice" log bucket [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312124 (https://phabricator.wikimedia.org/T144952) [22:25:52] Reedy: If you have a minute, does that look reasonable? ^ [22:26:42] yup, [22:26:44] should be fine [22:27:58] thanks for pointing me to the right variable! [22:31:07] !log awight@tin Synchronized php-1.28.0-wmf.18/extensions/CentralNotice: Correct CentralNotice logging for T144952 (duration: 00m 51s) [22:31:10] T144952: Banner not showing up on site - https://phabricator.wikimedia.org/T144952 [22:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:32:48] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [22:32:50] (03CR) 10Awight: [C: 032] Capture the "CentralNotice" log bucket [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312124 (https://phabricator.wikimedia.org/T144952) (owner: 10Awight) [22:33:05] !log awight@tin Synchronized php-1.28.0-wmf.20/extensions/CentralNotice: Correct CentralNotice logging for T144952 (duration: 00m 51s) [22:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:33:17] (03Merged) 10jenkins-bot: Capture the "CentralNotice" log bucket [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312124 (https://phabricator.wikimedia.org/T144952) (owner: 10Awight) [22:33:18] 06Operations: setup wmf4747/wmf4748/wmf4749/wmf4750 for temp kubernetes testing - https://phabricator.wikimedia.org/T146171#2657758 (10RobH) [22:35:08] !log awight@tin Synchronized wmf-config/InitialiseSettings.php: Add CentralNotice debug log bucket for T144952 (duration: 00m 48s) [22:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:37:05] (03PS1) 10Andrew Bogott: Make the nova_ldap notification handler thread-safe [puppet] - 10https://gerrit.wikimedia.org/r/312127 [22:41:32] PROBLEM - puppet last run on elastic1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:42:51] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Rocket Surgery 2016, 13Patch-For-Review: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2657832 (10awight) From now on, when this condition happens we should see a logline in `fluori... [22:42:52] odd shit for my install... why does it have no free leases when its coming over the private vlan for a private ip address assignment... [22:43:56] (03CR) 10Alex Monk: [C: 031] Make the nova_ldap notification handler thread-safe [puppet] - 10https://gerrit.wikimedia.org/r/312127 (owner: 10Andrew Bogott) [22:44:26] (03CR) 10Andrew Bogott: [C: 032] Make the nova_ldap notification handler thread-safe [puppet] - 10https://gerrit.wikimedia.org/r/312127 (owner: 10Andrew Bogott) [22:44:57] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [22:45:09] greg-g: thcipriani: I'm done deploying, CentralNotice & the wikis seems to be stable still. [22:47:10] nice :) [22:48:51] Today was a good day, e.g. I didn't even have to use my AK! [22:49:00] (03PS3) 10Faidon Liambotis: mail.wikimedia.org cert expires on Thursday 2016-09-22 [puppet] - 10https://gerrit.wikimedia.org/r/311641 (https://phabricator.wikimedia.org/T144568) (owner: 10RobH) [22:49:17] (03CR) 10Faidon Liambotis: [C: 032] mail.wikimedia.org cert expires on Thursday 2016-09-22 [puppet] - 10https://gerrit.wikimedia.org/r/311641 (https://phabricator.wikimedia.org/T144568) (owner: 10RobH) [22:49:52] (03CR) 10Faidon Liambotis: [V: 032] mail.wikimedia.org cert expires on Thursday 2016-09-22 [puppet] - 10https://gerrit.wikimedia.org/r/311641 (https://phabricator.wikimedia.org/T144568) (owner: 10RobH) [22:50:27] awight: /me rolls up one leg of my jeans, and not for riding my bike [22:50:59] O_O [22:52:17] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [22:53:41] 06Operations, 10Mail, 13Patch-For-Review: mx1001/2001 - Exim SMTP - Certificate expires Sep 22 2016 - https://phabricator.wikimedia.org/T144568#2657852 (10faidon) 05Open>03Resolved New, corrected certificates have been issued by @RobH and replaced on the two servers. [22:55:58] that was tight [22:57:32] !log change-prop deploying ea8cdf8 [22:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:00:05] RoanKattouw, ostriches, MaxSem, and Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160921T2300). [23:00:05] bmansurov and Jdlrobson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:23] here [23:00:38] PROBLEM - puppet last run on restbase1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:01:51] Hello. [23:01:55] I can SWAT this evening. [23:02:02] hello [23:02:41] (here) [23:04:04] Dereckson: since we're still officially in a held state re the train, I'd like to follow this: https://wikitech.wikimedia.org/wiki/Deployments/Holding_the_train#What_happens_in_SWAT_while_the_train_is_on_hold.3F [23:04:08] for tonight [23:04:11] * Dereckson nods. [23:04:28] ty :) [23:04:44] (we're more in a "catchup" mode right now, but still, to play it safe) [23:05:30] RECOVERY - puppet last run on restbase1012 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [23:05:49] (ah, I was checking version in use and wondered about wmf.20) [23:06:06] (the catchup explains that) [23:06:08] i guess that's my change out.. [23:06:09] * greg-g nods [23:06:34] jdlrobson: so you should reschudle your change [23:06:57] jdlrobson: we only take emergency fixes or simple config change this evening [23:06:57] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [23:07:23] Dereckson: my patch is still on for deployment right? [23:07:31] looking [23:07:35] It's a config change that doesn't affect code in production [23:07:45] ok [23:08:23] RECOVERY - puppet last run on elastic1022 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:09:00] Bsadowski1: yes, it seems in the scope of "simple config change" [23:10:04] (03PS3) 10Dereckson: Blacklist minerva from showing Related Articles in the footer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311197 (https://phabricator.wikimedia.org/T144912) (owner: 10Bmansurov) [23:10:51] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311197 (https://phabricator.wikimedia.org/T144912) (owner: 10Bmansurov) [23:11:18] (03Merged) 10jenkins-bot: Blacklist minerva from showing Related Articles in the footer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311197 (https://phabricator.wikimedia.org/T144912) (owner: 10Bmansurov) [23:11:48] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [23:12:55] bmansrov: live on mw1099 [23:13:02] Dereckson: great, thank you. [23:16:33] was there more lazyloadimages deployment recently (broader, I mean) [23:17:20] hmmm nevermind that question, wrong graph parameters :) [23:18:35] (03PS1) 10RobH: Revert "setting ip addresses for temp kubernetes hosts" [dns] - 10https://gerrit.wikimedia.org/r/312137 [23:18:54] bmansrov: looks good to you? (I imagine there is not a lot to test, as code using this setting hasn't been merged yet) [23:19:06] (03PS2) 10RobH: Revert "setting ip addresses for temp kubernetes hosts" [dns] - 10https://gerrit.wikimedia.org/r/312137 [23:19:36] Dereckson: yes, as long as the code is merged, I'm good. We'll test it once the code that utilizes this config change is in production. [23:19:39] Dereckson: thanks again [23:20:18] bmansrov: yes but do a minimal check on mw1099 nothing is broken please [23:22:25] Dereckson: that's test.wikipedia.org with the X-Wikimedia-Debug header right? [23:22:36] (03CR) 10RobH: [C: 032] "i put in the wrong vlans for these, using a, b, and c, rather than b, c and d." [dns] - 10https://gerrit.wikimedia.org/r/312137 (owner: 10RobH) [23:22:52] bmansrov: there's FF/Chrome extensions for this [23:23:06] https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug [23:23:19] Reedy: ok thanks [23:24:56] Dereckson: I did a quick test and RelatedArticles are loading fine (the config change was part of the related articles extension). [23:25:19] https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug can be converted into a Microsoft Edge extensions [23:26:10] using https://www.microsoft.com/en-gb/store/p/microsoft-edge-extension-toolkit/9nblggh4txvb?tduid=(9370d57dd0981b2411755262306d51d6)(213688)(2795219)()() [23:29:52] (03PS1) 10RobH: setting kubernetes test host ip addresses [dns] - 10https://gerrit.wikimedia.org/r/312138 [23:30:42] 06Operations: setup wmf4747/wmf4748/wmf4749/wmf4750 for temp kubernetes testing - https://phabricator.wikimedia.org/T146171#2657903 (10RobH) [23:31:02] (03CR) 10RobH: [C: 032] setting kubernetes test host ip addresses [dns] - 10https://gerrit.wikimedia.org/r/312138 (owner: 10RobH) [23:31:35] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Blacklist minerva from showing Related Articles in the footer (T144912, currently no-op) (duration: 00m 49s) [23:31:36] T144912: MEDIAWIKI_URL may be set to incorrect value in mwext-mw-selenium job - https://phabricator.wikimedia.org/T144912 [23:31:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:32:38] !log dereckson@tin Synchronized wmf-config/CommonSettings.php: Blacklist minerva from showing Related Articles in the footer (T144912, currently no-op) (duration: 00m 47s) [23:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:33:54] Reedy i just converted it to microsoft edge [23:33:58] with just one change [23:34:29] bmansrov: here you are. [23:34:36] Thanks for checking. [23:35:27] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:36:22] 14 Undefined variable: wmgRelatedArticlesFooterBlacklistedSkins in /srv/mediawiki/wmf-config/CommonSettings.php on line 2798 [23:37:12] Even syncing in the right order, we can get some issues. [23:37:18] Reedy https://github.com/paladox/EdgeWikimediaDebug :) [23:43:29] PROBLEM - puppet last run on elastic1044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues