[00:01:32] hu? I was enabling this on the wikisource family, not on sourceswiki it seems, writing a fix [00:01:33] manually checked, code definitely made it [00:01:40] k [00:01:55] PROBLEM - IPMI Temperature on labtestvirt2003 is CRITICAL: Return code of 255 is out of bounds [00:05:29] (03PS1) 10Dereckson: Fix correct wiki name for wgPageLanguageUseDB on www.wikisource.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377674 (https://phabricator.wikimedia.org/T175622) [00:05:31] thcipriani: ^ [00:06:11] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377674 (https://phabricator.wikimedia.org/T175622) (owner: 10Dereckson) [00:07:54] (03Merged) 10jenkins-bot: Fix correct wiki name for wgPageLanguageUseDB on www.wikisource.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377674 (https://phabricator.wikimedia.org/T175622) (owner: 10Dereckson) [00:08:08] (03CR) 10jenkins-bot: Fix correct wiki name for wgPageLanguageUseDB on www.wikisource.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377674 (https://phabricator.wikimedia.org/T175622) (owner: 10Dereckson) [00:08:27] Dereckson: ^ on mwdebug1002, check please [00:08:33] This time it works :) [00:08:47] cool, going live :) [00:11:25] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:377357|Enable Special:PageLanguage on mul.wikisource]] [[gerrit:377674|Fix correct wiki name for wgPageLanguageUseDB on www.wikisource.org]] T175622 (duration: 00m 49s) [00:11:33] ^ Dereckson live now [00:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:39] T175622: Enable Special:PageLanguage on mul.wikisource - https://phabricator.wikimedia.org/T175622 [00:13:16] Thanks for the deployment. [00:13:41] yw :) [00:13:46] PROBLEM - NTP on labtestvirt2003 is CRITICAL: NTP CRITICAL: No response from NTP server [00:19:51] (03PS1) 10Faidon Liambotis: salt: change roots/runner_dir from Array to String [puppet] - 10https://gerrit.wikimedia.org/r/377676 [00:25:27] (03PS2) 10Faidon Liambotis: salt: change roots/runner_dir from Array to String [puppet] - 10https://gerrit.wikimedia.org/r/377676 [00:26:06] (03PS1) 10Dzahn: add wikimania2019 [dns] - 10https://gerrit.wikimedia.org/r/377680 [00:26:20] (03CR) 10jerkins-bot: [V: 04-1] add wikimania2019 [dns] - 10https://gerrit.wikimedia.org/r/377680 (owner: 10Dzahn) [00:27:56] (03CR) 10Dzahn: "" Failed to open GeoIP2 database '/usr/share/GeoIP/GeoIP2-City.mmdb':" ?" [dns] - 10https://gerrit.wikimedia.org/r/377680 (owner: 10Dzahn) [00:28:33] (03PS2) 10Dzahn: add wikimania2019 [dns] - 10https://gerrit.wikimedia.org/r/377680 [00:28:46] (03CR) 10jerkins-bot: [V: 04-1] add wikimania2019 [dns] - 10https://gerrit.wikimedia.org/r/377680 (owner: 10Dzahn) [00:30:15] 10Operations, 10Patch-For-Review: CPU throttling on DELL PowerEdge R320 - https://phabricator.wikimedia.org/T162850#3602830 (10Dzahn) [00:31:18] 10Operations, 10Patch-For-Review: CPU throttling on DELL PowerEdge R320 - https://phabricator.wikimedia.org/T162850#3177068 (10Dzahn) [00:42:29] paravoid: do i need to take any precautions before rebooting a dnsrecursor, (and ntpd) in this case acamar, the "service restarts" wiki page seems to say it's just fine and doesn't even mention depooling. but i could depool it with confctl [00:42:41] (03PS8) 10GeoffreyT2000: Rename Wikisaurus namespace on Wiktionary to "Thesaurus" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374063 (https://phabricator.wikimedia.org/T174264) [00:43:06] mutante: there were some issues around recursor availability and Node.js services, IIRC -- bblack will know more [00:43:15] paravoid: ok, thanks! [00:50:59] (03CR) 10Reedy: "o_0" [dns] - 10https://gerrit.wikimedia.org/r/377680 (owner: 10Dzahn) [00:52:28] (03PS2) 10Faidon Liambotis: openstack2: use !~ instead of ! $title =~ /.../ [puppet] - 10https://gerrit.wikimedia.org/r/377512 [00:53:35] (03CR) 10Dzahn: "it's 2019, not 2018 though?!" [dns] - 10https://gerrit.wikimedia.org/r/377680 (owner: 10Dzahn) [00:53:45] (03CR) 10Faidon Liambotis: [C: 032] openstack2: use !~ instead of ! $title =~ /.../ [puppet] - 10https://gerrit.wikimedia.org/r/377512 (owner: 10Faidon Liambotis) [00:54:32] (03PS2) 10Faidon Liambotis: racktables: pass $racktables_host to module [puppet] - 10https://gerrit.wikimedia.org/r/377669 [00:55:16] (03CR) 10Faidon Liambotis: [C: 032] racktables: pass $racktables_host to module [puppet] - 10https://gerrit.wikimedia.org/r/377669 (owner: 10Faidon Liambotis) [00:55:23] (03CR) 10Dzahn: [C: 031] racktables: pass $racktables_host to module [puppet] - 10https://gerrit.wikimedia.org/r/377669 (owner: 10Faidon Liambotis) [00:55:29] (03CR) 10Faidon Liambotis: [C: 032] "PCC: http://puppet-compiler.wmflabs.org/7835/krypton.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/377669 (owner: 10Faidon Liambotis) [00:59:35] (03PS3) 10Faidon Liambotis: salt: change roots/runner_dir from Array to String [puppet] - 10https://gerrit.wikimedia.org/r/377676 [00:59:37] (03PS2) 10Faidon Liambotis: thumbor: fix weird integer interpolation [puppet] - 10https://gerrit.wikimedia.org/r/377513 [01:01:28] (03PS3) 10Dzahn: aptly: support components for clients [puppet] - 10https://gerrit.wikimedia.org/r/374813 (owner: 10Hashar) [01:03:45] RECOVERY - Check Varnish expiry mailbox lag on cp1049 is OK: OK: expiry mailbox lag is 0 [01:07:29] (03CR) 10Faidon Liambotis: "PCC looks good, both change and future:" [puppet] - 10https://gerrit.wikimedia.org/r/377513 (owner: 10Faidon Liambotis) [01:08:13] (03CR) 10Dzahn: [C: 031] salt: change roots/runner_dir from Array to String [puppet] - 10https://gerrit.wikimedia.org/r/377676 (owner: 10Faidon Liambotis) [01:09:00] (03CR) 10Faidon Liambotis: [C: 032] "PCC looks good, both change and future:" [puppet] - 10https://gerrit.wikimedia.org/r/377676 (owner: 10Faidon Liambotis) [01:09:20] (03PS5) 10Jcrespo: Add new m1 host db2078, enable firewall on all misc services [puppet] - 10https://gerrit.wikimedia.org/r/377460 (https://phabricator.wikimedia.org/T175685) [01:10:47] (03CR) 10Dzahn: [C: 031] aptly: support https [puppet] - 10https://gerrit.wikimedia.org/r/374837 (owner: 10Hashar) [01:11:05] (03PS3) 10Faidon Liambotis: role::snapshot::common: properly scope included classes [puppet] - 10https://gerrit.wikimedia.org/r/377493 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [01:17:01] (03CR) 10Faidon Liambotis: "No-op according to the PCC:" [puppet] - 10https://gerrit.wikimedia.org/r/377493 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [01:19:21] (03PS1) 10Dzahn: grafana: quote reserved word 'type' [puppet] - 10https://gerrit.wikimedia.org/r/377685 [01:19:38] !log stopping mariadb @ db1048 to clone it to db1059 [01:19:43] (03CR) 10jerkins-bot: [V: 04-1] grafana: quote reserved word 'type' [puppet] - 10https://gerrit.wikimedia.org/r/377685 (owner: 10Dzahn) [01:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:20:11] mutante: Ifbb9a3b91222e00dca8fabe7208b12585ce64564 ? [01:20:52] mutante: also yours seem to be a merge conflict [01:20:54] paravoid: i see :) i'm too late :) [01:21:04] yea, messed up rebase [01:21:42] (03Abandoned) 10Dzahn: grafana: quote reserved word 'type' [puppet] - 10https://gerrit.wikimedia.org/r/377685 (owner: 10Dzahn) [01:22:22] i used the list from https://puppet-compiler.wmflabs.org/compiler02/7683/index-future.html [01:22:33] and randomly picked krypton [01:22:34] it's already outdated :) [01:22:39] I'll update the task in a bit [01:22:42] yep :) and cool [01:24:05] PROBLEM - haproxy failover on dbproxy1003 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [01:24:07] PROBLEM - haproxy failover on dbproxy1008 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [01:24:46] (03PS1) 10Faidon Liambotis: snapshot: do a proper scope lookup for apachedir [puppet] - 10https://gerrit.wikimedia.org/r/377686 [01:26:05] ACKNOWLEDGEMENT - haproxy failover on dbproxy1003 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Jcrespo cloning db1048 [01:26:06] ACKNOWLEDGEMENT - haproxy failover on dbproxy1008 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Jcrespo cloning db1048 [01:26:42] going for dinner then, will see later what is left [01:26:49] mutante: nothing really [01:26:58] everything is done I think [01:28:58] (03CR) 10Faidon Liambotis: [C: 032] "Before:" [puppet] - 10https://gerrit.wikimedia.org/r/377686 (owner: 10Faidon Liambotis) [01:30:46] (03PS32) 10Faidon Liambotis: cassandra: future parser and Puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) (owner: 10Gehel) [01:39:36] (03CR) 10Faidon Liambotis: [C: 04-1] "Fails with the PCC:" [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) (owner: 10Gehel) [01:41:20] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Switch all hosts to the future parser - https://phabricator.wikimedia.org/T171704#3602978 (10faidon) I pushed and merged a bunch of changes under Gerrit's [[ https://gerrit.wikimedia.org/r/#/q/topic:future-parser | topic:future-parser ]] today. I also... [01:57:45] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:58:54] (03PS1) 10Jcrespo: mariadb: Repoint m3 secondary host (replica) to db1059 [puppet] - 10https://gerrit.wikimedia.org/r/377687 (https://phabricator.wikimedia.org/T175679) [02:00:06] (03PS3) 10Faidon Liambotis: thumbor: fix weird integer interpolation [puppet] - 10https://gerrit.wikimedia.org/r/377513 [02:00:08] (03PS1) 10Faidon Liambotis: hhvm: use '', not undef for light_process_file_prefix [puppet] - 10https://gerrit.wikimedia.org/r/377688 [02:00:10] (03PS1) 10Faidon Liambotis: scap: do scope lookups in mw-deployment-vars.erb [puppet] - 10https://gerrit.wikimedia.org/r/377689 [02:06:09] (03CR) 10Faidon Liambotis: "This and its sibling are noops with the current parser:" [puppet] - 10https://gerrit.wikimedia.org/r/377688 (owner: 10Faidon Liambotis) [02:06:17] (03CR) 10Faidon Liambotis: "This and its sibling are noops with the current parser:" [puppet] - 10https://gerrit.wikimedia.org/r/377689 (owner: 10Faidon Liambotis) [02:11:07] (03CR) 10Jcrespo: [C: 032] mariadb: Repoint m3 secondary host (replica) to db1059 [puppet] - 10https://gerrit.wikimedia.org/r/377687 (https://phabricator.wikimedia.org/T175679) (owner: 10Jcrespo) [02:15:56] (03CR) 10Chad: [C: 032] Hygiene: Remove dead config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372838 (owner: 10Jdlrobson) [02:17:05] RECOVERY - haproxy failover on dbproxy1008 is OK: OK check_failover servers up 2 down 0 [02:17:25] (03Merged) 10jenkins-bot: Hygiene: Remove dead config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372838 (owner: 10Jdlrobson) [02:17:35] (03CR) 10jenkins-bot: Hygiene: Remove dead config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372838 (owner: 10Jdlrobson) [02:18:05] RECOVERY - haproxy failover on dbproxy1003 is OK: OK check_failover servers up 2 down 0 [02:21:31] paravoid: wow! hah, ok :) [02:24:56] (03CR) 10Dzahn: [C: 032] aptly: support components for clients [puppet] - 10https://gerrit.wikimedia.org/r/374813 (owner: 10Hashar) [02:25:14] (03PS4) 10Dzahn: aptly: support components for clients [puppet] - 10https://gerrit.wikimedia.org/r/374813 (owner: 10Hashar) [02:27:33] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.17) (duration: 08m 31s) [02:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:25] (03PS5) 10Dzahn: aptly: support https [puppet] - 10https://gerrit.wikimedia.org/r/374837 (owner: 10Hashar) [02:29:26] (03CR) 10Dzahn: [C: 032] "it's not really switching to https, it's just stops hardcoding http" [puppet] - 10https://gerrit.wikimedia.org/r/374837 (owner: 10Hashar) [02:29:32] (03PS1) 10Jcrespo: phabricator/mariadb: Update database configuration for stretch/10.1 [puppet] - 10https://gerrit.wikimedia.org/r/377693 (https://phabricator.wikimedia.org/T175679) [02:31:27] (03CR) 10Dzahn: "what was wrong after the first revert?" [puppet] - 10https://gerrit.wikimedia.org/r/377269 (https://phabricator.wikimedia.org/T172333) (owner: 10Alexandros Kosiaris) [02:36:23] (03PS2) 10Jcrespo: phabricator/mariadb: Update database configuration for stretch/10.1 [puppet] - 10https://gerrit.wikimedia.org/r/377693 (https://phabricator.wikimedia.org/T175679) [02:38:04] 10Operations, 10Patch-For-Review: CPU throttling on DELL PowerEdge R320 - https://phabricator.wikimedia.org/T162850#3603007 (10Dzahn) Thanks @gehel! So only acamar (DNS recursor) and baham (ns1.wm.org auth DNS) are left. @bblack are there special precautions to be taken before rebooting these? For the recurso... [02:40:07] 10Operations, 10monitoring, 10Patch-For-Review: Fix Icinga checks for test/decom servers - https://phabricator.wikimedia.org/T151632#3603012 (10Dzahn) @akosiaris Sorry for the late reply, yes that definitely sounds reasonable. Let's go ahead as you suggested, incl. the part about evaluating AUTO-ACKS later i... [02:41:13] !log demon@tin Synchronized wmf-config/mobile.php: rm unused config var (duration: 00m 46s) [02:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:42:37] (03CR) 10Dzahn: "i am close to abandoning this in favor of Alex approach https://gerrit.wikimedia.org/r/#/c/373291/ convinced by ticket comments. but mayb" [puppet] - 10https://gerrit.wikimedia.org/r/368124 (https://phabricator.wikimedia.org/T151632) (owner: 10Dzahn) [02:43:06] (03CR) 10Jcrespo: [C: 04-1] "It should not be deployed until we failover the m3 master to db1059." [puppet] - 10https://gerrit.wikimedia.org/r/377693 (https://phabricator.wikimedia.org/T175679) (owner: 10Jcrespo) [02:45:40] (03CR) 10Dzahn: "cool with me if this replaces https://gerrit.wikimedia.org/r/#/c/368124/ (or maybe they can even both be merged), but +1 to what volans sa" [puppet] - 10https://gerrit.wikimedia.org/r/373291 (https://phabricator.wikimedia.org/T151632) (owner: 10Alexandros Kosiaris) [02:48:15] (03PS1) 10BryanDavis: tools: add an exim sender blocklist [puppet] - 10https://gerrit.wikimedia.org/r/377697 [02:48:47] (03PS6) 10Dzahn: bastionhost: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/353599 [02:58:58] (03PS6) 10Jcrespo: mariadb/phabricator: update GRANTS from iridium to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/369832 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [03:01:06] (03PS1) 10Chad: group1 to wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377700 [03:01:15] (03CR) 10Chad: [C: 04-2] "For tomorrow" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377700 (owner: 10Chad) [03:03:50] !log dropping old users from phabricator database [03:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:05:15] PROBLEM - Check Varnish expiry mailbox lag on cp1062 is CRITICAL: CRITICAL: expiry mailbox lag is 2006248 [03:05:16] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.18) (duration: 16m 28s) [03:05:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:08:24] (03PS1) 10Jcrespo: misc dbs: Repoint m3-slave to the new replica server db1059 [dns] - 10https://gerrit.wikimedia.org/r/377701 (https://phabricator.wikimedia.org/T175679) [03:09:33] (03PS7) 10Jcrespo: mariadb/phabricator: update GRANTS from iridium to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/369832 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [03:10:18] (03CR) 10Jcrespo: [C: 032] mariadb/phabricator: update GRANTS from iridium to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/369832 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [03:11:05] (03CR) 10Jcrespo: [C: 032] misc dbs: Repoint m3-slave to the new replica server db1059 [dns] - 10https://gerrit.wikimedia.org/r/377701 (https://phabricator.wikimedia.org/T175679) (owner: 10Jcrespo) [03:12:46] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Sep 13 03:12:46 UTC 2017 (duration 7m 30s) [03:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:21:25] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:21:25] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:21:25] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [03:21:25] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:21:25] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:21:25] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [03:21:35] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [03:21:36] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:21:36] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:24:25] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [03:24:25] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [03:24:25] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [03:24:26] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [03:24:26] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [03:24:26] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [03:24:45] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [03:24:45] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [03:24:45] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [03:31:29] 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3603218 (10Dzahn) I ran some more cumin (out of 1331 hosts, 91 had one or more screens running, i had closed a handful i had started myself). The overall number of screens is ar... [03:38:25] (03CR) 10Dzahn: [C: 04-1] "why does this fail with "Error: Could not find class ::prometheus::ops " but only on 2 hosts and not all ?!" [puppet] - 10https://gerrit.wikimedia.org/r/353599 (owner: 10Dzahn) [03:44:48] (03PS7) 10Dzahn: bastionhost: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/353599 [03:47:25] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [03:49:35] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [03:59:57] (03CR) 10Dzahn: [C: 04-1] "still something wrong with 3002 and 4001 but it's different now.. will follow-up" [puppet] - 10https://gerrit.wikimedia.org/r/353599 (owner: 10Dzahn) [04:06:32] (03PS1) 10Jcrespo: mariadb - phabricator: Remove public hashes from configuration files [puppet] - 10https://gerrit.wikimedia.org/r/377703 (https://phabricator.wikimedia.org/T163938) [04:06:45] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [04:08:09] (03CR) 10Jcrespo: [C: 032] mariadb - phabricator: Remove public hashes from configuration files [puppet] - 10https://gerrit.wikimedia.org/r/377703 (https://phabricator.wikimedia.org/T163938) (owner: 10Jcrespo) [04:08:45] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [04:11:25] (03PS1) 10Jcrespo: phabricator-mariadb: Add missing include [puppet] - 10https://gerrit.wikimedia.org/r/377704 [04:11:55] (03CR) 10Jcrespo: [C: 032] phabricator-mariadb: Add missing include [puppet] - 10https://gerrit.wikimedia.org/r/377704 (owner: 10Jcrespo) [04:19:21] (03CR) 10KartikMistry: "recheck" [debs/contenttranslation/apertium-crh-tur] - 10https://gerrit.wikimedia.org/r/377449 (https://phabricator.wikimedia.org/T174765) (owner: 10KartikMistry) [04:19:27] (03CR) 10jerkins-bot: [V: 04-1] apertium-crh-tur: Initial Debian packaging [debs/contenttranslation/apertium-crh-tur] - 10https://gerrit.wikimedia.org/r/377449 (https://phabricator.wikimedia.org/T174765) (owner: 10KartikMistry) [04:27:38] 10Operations, 10DBA, 10Phabricator: Decom db1048 (BBU Faulty - slave lagging) - https://phabricator.wikimedia.org/T160731#3603287 (10jcrespo) [04:28:41] (03PS1) 10Jcrespo: decommission: Set db1048 with a spare role [puppet] - 10https://gerrit.wikimedia.org/r/377705 (https://phabricator.wikimedia.org/T175679) [04:29:43] (03PS2) 10Jcrespo: decommission: Set db1048 with a spare role [puppet] - 10https://gerrit.wikimedia.org/r/377705 (https://phabricator.wikimedia.org/T175679) [04:30:21] (03CR) 10Jcrespo: [C: 032] decommission: Set db1048 with a spare role [puppet] - 10https://gerrit.wikimedia.org/r/377705 (https://phabricator.wikimedia.org/T175679) (owner: 10Jcrespo) [04:37:23] 10Operations, 10ops-eqiad, 10DBA, 10Phabricator: Decommission db1048 (was Move m3 slave to db1059) - https://phabricator.wikimedia.org/T175679#3603295 (10jcrespo) [05:15:49] !log kartik@tin Started deploy [cxserver/deploy@ee0081d]: Update cxserver to 3baaa36 [05:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:45] !log kartik@tin Finished deploy [cxserver/deploy@ee0081d]: Update cxserver to 3baaa36 (duration: 02m 56s) [05:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:22] RECOVERY - Check Varnish expiry mailbox lag on cp1062 is OK: OK: expiry mailbox lag is 28141 [05:46:33] PROBLEM - Check size of conntrack table on labnet1001 is CRITICAL: CRITICAL: nf_conntrack is 93 % full [05:59:52] RECOVERY - Check size of conntrack table on labnet1001 is OK: OK: nf_conntrack is 79 % full [06:38:12] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 49 probes of 287 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [06:48:12] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 11 probes of 287 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [07:08:58] !log installing tcpdump security updates (fixing 90 CVE IDs...) [07:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:31] https://lists.debian.org/debian-security-announce/2017/msg00233.html wow :o [07:28:04] (03CR) 10Alexandros Kosiaris: [C: 032] Fix apertium-all-dev Depends [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/377389 (owner: 10KartikMistry) [07:30:05] 10Operations, 10Patch-For-Review: CPU throttling on DELL PowerEdge R320 - https://phabricator.wikimedia.org/T162850#3603399 (10MoritzMuehlenhoff) The last time we rebooted DNS recursors eventbus bailed, but that specific bug has been fixed since then. The high level ticket is T171498. So quick depool should be... [07:30:08] (03CR) 10Alexandros Kosiaris: "mediawiki was undeployable. That happened because while this patch was live, the gerrit deploy key was added, alphabetically bumping the m" [puppet] - 10https://gerrit.wikimedia.org/r/377269 (https://phabricator.wikimedia.org/T172333) (owner: 10Alexandros Kosiaris) [07:38:03] PROBLEM - puppet last run on analytics1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tcpdump] [07:44:14] (03PS3) 10Alexandros Kosiaris: Ship the default egress policy [puppet] - 10https://gerrit.wikimedia.org/r/377470 (https://phabricator.wikimedia.org/T170111) [07:47:16] !log rolling out hhvm-luasandbox 2.0.14 to the remaining hosts in eqiad (along with HHVM restarts) (T173705) [07:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:29] T173705: HHVM: Unknown exception - https://phabricator.wikimedia.org/T173705 [07:54:58] (03CR) 10Alexandros Kosiaris: [C: 031] "So, this class is included in 120 hosts. The groups are" [puppet] - 10https://gerrit.wikimedia.org/r/376500 (https://phabricator.wikimedia.org/T175242) (owner: 10Ladsgroup) [07:55:52] PROBLEM - HHVM jobrunner on mw1167 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [07:56:03] !log bounce varnish on cp1062, mailbox lag [07:56:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:42] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [07:56:52] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [07:56:53] RECOVERY - HHVM jobrunner on mw1167 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.004 second response time [07:57:02] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [07:57:32] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [07:59:28] 10Operations, 10Discovery, 10Maps, 10Maps-Sprint, 10monitoring: Map caches metrics look broken - https://phabricator.wikimedia.org/T141186#3603433 (10Gehel) [08:00:00] 10Operations, 10Discovery, 10Maps, 10Maps-Sprint, 10monitoring: Map caches metrics look broken - https://phabricator.wikimedia.org/T141186#2489781 (10Gehel) Let's review this in next maps standup and close if we agree. [08:00:34] godog: do you already have a task to port application metrics to prometheus? I'm trying to add moving elasticsearch metrics to our board... [08:01:34] (03CR) 10Mobrovac: [C: 031] "I'd like to be around for scb, wtp, restbase* and the ones you don't want to touch :)" [puppet] - 10https://gerrit.wikimedia.org/r/376500 (https://phabricator.wikimedia.org/T175242) (owner: 10Ladsgroup) [08:03:02] PROBLEM - HHVM jobrunner on mw1166 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [08:03:07] gehel: yeah there's T145659 which is a generic one, not a specifc one yet for jmx tho [08:03:08] T145659: Port application-specific metrics from ganglia to prometheus - https://phabricator.wikimedia.org/T145659 [08:03:35] godog: thanks! I'll create a subtask for elasticsearch... [08:04:02] RECOVERY - HHVM jobrunner on mw1166 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [08:04:34] godog: side note, we should probably have some standard metrics we collect in the same way through for all JMVs [08:06:02] RECOVERY - puppet last run on analytics1002 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [08:06:02] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:06:32] gehel: indeed! there is a baseline of metrics jmx_exporter collects from the jvm itself, we'll need to find out if that's enough or we'd need more [08:06:36] godog: that ticket is about moving from ganglia to prometheus, I thought we wanted to port the metrics collected by diamond -> graphite to prometheus? [08:06:41] (03CR) 10Alexandros Kosiaris: "overall LGTM, inline comments" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/377697 (owner: 10BryanDavis) [08:06:43] (03CR) 10Filippo Giunchedi: [C: 031] thumbor: fix weird integer interpolation [puppet] - 10https://gerrit.wikimedia.org/r/377513 (owner: 10Faidon Liambotis) [08:06:48] (03CR) 10Ladsgroup: "I doubt striker, aqs, ocg, and maps would change, they defined logstash1001 (or logstash1003) explicitly in their configs so they need ano" [puppet] - 10https://gerrit.wikimedia.org/r/376500 (https://phabricator.wikimedia.org/T175242) (owner: 10Ladsgroup) [08:07:22] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:07:42] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:09:02] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:09:34] (03PS4) 10Giuseppe Lavagetto: thumbor: fix weird integer interpolation [puppet] - 10https://gerrit.wikimedia.org/r/377513 (owner: 10Faidon Liambotis) [08:09:52] <_joe_> godog: merging this ^^ [08:10:16] _joe_: ok, thanks! [08:10:38] gehel: indeed, let me open a new one! [08:10:49] godog: thanks! [08:11:53] <_joe_> godog: uhm wait a sec [08:11:58] <_joe_> I'm not sure it's correct [08:12:51] <_joe_> that third argument shouldn't be there [08:13:11] 10Operations, 10monitoring: Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T175798#3603447 (10fgiunchedi) [08:13:18] gehel: ^ [08:13:35] _joe_: looks like that's the second argument for prefix [08:13:46] godog: thanks! And also thanks for the fix on labs puppetmaster! [08:14:00] gehel: hehe that was ottomata actually! [08:14:17] well, than thanks ottomata :) [08:14:32] <_joe_> godog: uh? [08:14:39] <_joe_> oh, right [08:14:50] <_joe_> damn gerrit [08:15:07] <_joe_> I had to download the patch and open it in aterminal [08:16:02] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [08:17:02] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.007 second response time [08:17:15] (03CR) 10Giuseppe Lavagetto: [C: 032] thumbor: fix weird integer interpolation [puppet] - 10https://gerrit.wikimedia.org/r/377513 (owner: 10Faidon Liambotis) [08:17:25] (03PS1) 10Hashar: contint: remove references to Ubuntu/Trusty [puppet] - 10https://gerrit.wikimedia.org/r/377717 (https://phabricator.wikimedia.org/T175696) [08:18:58] (03CR) 10Hashar: "The only parts on contint1001/contint2001 are the comment changes in Apache configuration files." [puppet] - 10https://gerrit.wikimedia.org/r/377717 (https://phabricator.wikimedia.org/T175696) (owner: 10Hashar) [08:23:04] 10Operations, 10monitoring, 10Discovery-Search (Current work): port elasticsearch diamond collector to prometheus - https://phabricator.wikimedia.org/T175799#3603468 (10Gehel) [08:23:22] (03CR) 10Giuseppe Lavagetto: [C: 032] hhvm: use '', not undef for light_process_file_prefix [puppet] - 10https://gerrit.wikimedia.org/r/377688 (owner: 10Faidon Liambotis) [08:23:28] (03PS2) 10Giuseppe Lavagetto: hhvm: use '', not undef for light_process_file_prefix [puppet] - 10https://gerrit.wikimedia.org/r/377688 (owner: 10Faidon Liambotis) [08:24:04] (03CR) 10Alexandros Kosiaris: [C: 031] "ocg does not use that part of the code. aqs and maps indeed explicitly mention logstash1001. aqs seems to duplicate parts of the configura" [puppet] - 10https://gerrit.wikimedia.org/r/376500 (https://phabricator.wikimedia.org/T175242) (owner: 10Ladsgroup) [08:24:28] !log installing emacs security updates [08:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:31] (03PS2) 10Giuseppe Lavagetto: scap: do scope lookups in mw-deployment-vars.erb [puppet] - 10https://gerrit.wikimedia.org/r/377689 (owner: 10Faidon Liambotis) [08:33:23] (03CR) 10Giuseppe Lavagetto: [C: 032] scap: do scope lookups in mw-deployment-vars.erb [puppet] - 10https://gerrit.wikimedia.org/r/377689 (owner: 10Faidon Liambotis) [08:39:32] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [08:39:51] <_joe_> second time this morning ^^ [08:40:52] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [08:41:32] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [08:42:39] the x-cache headers seems to show ints for cp1054 [08:43:12] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [08:43:21] <_joe_> yeah [08:44:25] (03CR) 10Volans: [C: 04-1] "The YAML generation should be modified, see more details and other few minor comments inline." (035 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/375980 (owner: 10Muehlenhoff) [08:46:24] elukey, _joe_: there is also a report in -tech, maybe you want to comment there too [08:46:53] the weird thing is that the two mailbox expiry lag alerts are for upload cache hosts [08:47:12] <_joe_> let's take a harder look [08:47:15] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp1054&var-datasource=eqiad%20prometheus%2Fops&from=now-3h&to=now&panelId=21&fullscreen [08:47:22] cp1054 peaked for a bit [08:47:38] that alings with the 503s afaics [08:47:55] <_joe_> yes [08:48:38] <_joe_> that's the current one, not sure about the preceding peak [08:50:22] the other one that ended up at ~4 AM UTC seems to have x-cache ints for cp1072 [08:50:33] (upload) [08:51:33] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:51:54] indeed, the earlier spike this morning looks like it was also from text https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text [08:52:22] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:52:42] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:53:10] I'd restart varnish backend on cp1072 now, that is the critical in icinga.. [08:53:29] it is upload but it might fire again in that state [08:53:47] <_joe_> elukey: yes, use the dedicated script [08:53:59] elukey: go for it, doesn't look like the recurring upload mailbox problem [08:54:38] (03PS3) 10Muehlenhoff: Readd rollback handling to debdeploy [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/375980 [08:55:02] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:55:20] !log restart varnish backend on cp1072 (upload) - mailbox expiry lag [08:55:23] (03CR) 10Giuseppe Lavagetto: "The patch seems to fail on the hosts that have cassandra 3 and are in the main cluster." [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) (owner: 10Gehel) [08:55:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:03] (03CR) 10DCausse: Setup Cirrus MLR models for top 20 language AB test (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377393 (owner: 10EBernhardson) [08:58:02] RECOVERY - Check Varnish expiry mailbox lag on cp1072 is OK: OK: expiry mailbox lag is 0 [08:58:07] done, varnishlog looks good [08:59:01] oh nice, I've missed all pages since monday morning. [08:59:09] * apergos glares once again at their android phone [09:02:03] looks like both cp1065 and cp1054 created 10x their usual threads at the time they started failing and a ton of shortlived objects [09:02:08] (03PS1) 10Alexandros Kosiaris: package_builder: Update README.md [puppet] - 10https://gerrit.wikimedia.org/r/377718 [09:05:46] !log upload apertium_3.4.2~r68466-2+wmf2 to apt.wikimedia.org/jessie-wikimedia [09:05:48] kart_: ^ [09:05:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:56] (03CR) 10Alexandros Kosiaris: [C: 032] package_builder: Update README.md [puppet] - 10https://gerrit.wikimedia.org/r/377718 (owner: 10Alexandros Kosiaris) [09:09:33] (03Abandoned) 10Samtar: Make both LoginNotify email features default for Hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374082 (https://phabricator.wikimedia.org/T174263) (owner: 10Samtar) [09:12:24] (03CR) 10Alexandros Kosiaris: [C: 04-1] apertium-crh: Initial Debian packaging (031 comment) [debs/contenttranslation/apertium-crh] - 10https://gerrit.wikimedia.org/r/377390 (https://phabricator.wikimedia.org/T174765) (owner: 10KartikMistry) [09:13:02] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-tur: Initial Debian packaging [debs/contenttranslation/apertium-tur] - 10https://gerrit.wikimedia.org/r/377392 (https://phabricator.wikimedia.org/T174765) (owner: 10KartikMistry) [09:13:23] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-cat: New upstream release [debs/contenttranslation/apertium-cat] - 10https://gerrit.wikimedia.org/r/377395 (https://phabricator.wikimedia.org/T174988) (owner: 10KartikMistry) [09:13:36] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-ita: New upstream release [debs/contenttranslation/apertium-ita] - 10https://gerrit.wikimedia.org/r/377398 (https://phabricator.wikimedia.org/T174988) (owner: 10KartikMistry) [09:17:22] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [09:17:38] * elukey sigh [09:18:02] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [09:18:04] seems 1054 again [09:18:38] not that big this time [09:18:40] already recovered [09:18:51] 10Operations, 10Traffic: Text eqiad varnish 503 spikes - https://phabricator.wikimedia.org/T175803#3603574 (10fgiunchedi) [09:18:54] heh, I just opened ^ [09:20:14] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-srd: New upstream release [debs/contenttranslation/apertium-srd] - 10https://gerrit.wikimedia.org/r/377396 (https://phabricator.wikimedia.org/T174988) (owner: 10KartikMistry) [09:20:52] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [09:21:12] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [09:23:19] (03PS1) 10Urbanecm: Make hiwiki logo a little bit bigger [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377719 (https://phabricator.wikimedia.org/T175721) [09:23:31] !log upload apertium-cat_2.3.0~r82237-1+wmf1 to apt.wikimedia.org/jessie-wikimedia/main [09:23:32] !log upload apertium-ita_0.10.0~r82237-1+wmf1 to apt.wikimedia.org/jessie-wikimedia/main [09:23:32] !log upload apertium-tur_0.1.0~r81882-1+wmf1 to apt.wikimedia.org/jessie-wikimedia/main [09:23:34] T174988 [09:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:11] T174988: Update apertium-cat, apertium-srd, apertium-ita and apertium-srd-ita packages - https://phabricator.wikimedia.org/T174988 [09:24:29] Morning [09:24:35] Planned upgrades this week? [09:24:43] It was bit slow this morning [09:24:52] (Wikipedia that is) [09:25:04] But I'd been having general performance issues with internet anyway [09:25:49] !log upload apertium-srd_0.10.0~r82237-1+wmf1_amd64.changes to apt.wikimedia.org/jessie-wikimedia/main T174988 [09:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:09] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-crh-tur] - 10https://gerrit.wikimedia.org/r/377449 (https://phabricator.wikimedia.org/T174765) (owner: 10KartikMistry) [09:26:12] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:26:15] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-cat-srd] - 10https://gerrit.wikimedia.org/r/377402 (https://phabricator.wikimedia.org/T174987) (owner: 10KartikMistry) [09:26:17] (03CR) 10jerkins-bot: [V: 04-1] apertium-crh-tur: Initial Debian packaging [debs/contenttranslation/apertium-crh-tur] - 10https://gerrit.wikimedia.org/r/377449 (https://phabricator.wikimedia.org/T174765) (owner: 10KartikMistry) [09:26:20] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-srd-ita] - 10https://gerrit.wikimedia.org/r/377399 (https://phabricator.wikimedia.org/T174988) (owner: 10KartikMistry) [09:26:22] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:26:33] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:26:53] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:27:14] ShakespeareFan00: not something out of the ordinary. my guess it's related to your internet issues [09:30:54] (03PS2) 10Alexandros Kosiaris: contint: remove references to Ubuntu/Trusty [puppet] - 10https://gerrit.wikimedia.org/r/377717 (https://phabricator.wikimedia.org/T175696) (owner: 10Hashar) [09:30:59] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] contint: remove references to Ubuntu/Trusty [puppet] - 10https://gerrit.wikimedia.org/r/377717 (https://phabricator.wikimedia.org/T175696) (owner: 10Hashar) [09:31:14] (03PS6) 10Jcrespo: Add new m1 host db2078, enable firewall on all misc services [puppet] - 10https://gerrit.wikimedia.org/r/377460 (https://phabricator.wikimedia.org/T175685) [09:32:13] akosiaris: \o/ [09:33:13] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-cat-srd: Initial Debian packaging [debs/contenttranslation/apertium-cat-srd] - 10https://gerrit.wikimedia.org/r/377402 (https://phabricator.wikimedia.org/T174987) (owner: 10KartikMistry) [09:33:32] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-srd-ita: New upstream release [debs/contenttranslation/apertium-srd-ita] - 10https://gerrit.wikimedia.org/r/377399 (https://phabricator.wikimedia.org/T174988) (owner: 10KartikMistry) [09:34:06] :) [09:34:33] (03PS1) 10Muehlenhoff: Add emacs-nox to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/377721 [09:34:54] (03PS1) 10Jcrespo: mariadb: Fix bug by which lag shown is 1000 larger than real [puppet] - 10https://gerrit.wikimedia.org/r/377722 [09:35:39] (03PS2) 10Jcrespo: mariadb: Fix bug by which lag shown is 1000 larger than real [puppet] - 10https://gerrit.wikimedia.org/r/377722 [09:35:43] (03Abandoned) 10Hashar: (DO NOT SUBMIT) contint: pin firefox to 46 on Trusty [puppet] - 10https://gerrit.wikimedia.org/r/293739 (https://phabricator.wikimedia.org/T137561) (owner: 10JanZerebecki) [09:36:29] !log installing bind updates from Debian stable/oldstable SUA update (client tools only) [09:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:27] (03CR) 10Jcrespo: [C: 032] mariadb: Fix bug by which lag shown is 1000 larger than real [puppet] - 10https://gerrit.wikimedia.org/r/377722 (owner: 10Jcrespo) [09:43:16] (03CR) 10Filippo Giunchedi: [C: 031] Add emacs-nox to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/377721 (owner: 10Muehlenhoff) [09:43:23] ACKNOWLEDGEMENT - MD RAID on lvs3001 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T175808 [09:43:27] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T175808#3603681 (10ops-monitoring-bot) [09:43:36] (03CR) 10Jayprakash12345: [C: 031] Make hiwiki logo a little bit bigger [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377719 (https://phabricator.wikimedia.org/T175721) (owner: 10Urbanecm) [09:44:41] WUT? degraded but not failed? that's new [09:44:53] PROBLEM - salt-minion processes on lvs3001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [09:45:03] PROBLEM - Check systemd state on lvs3001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:45:06] (03PS3) 10Filippo Giunchedi: ferm: add return traffic for ferm::client notrack [puppet] - 10https://gerrit.wikimedia.org/r/374169 (https://phabricator.wikimedia.org/T173731) [09:46:25] (03CR) 10Filippo Giunchedi: [C: 032] ferm: add return traffic for ferm::client notrack [puppet] - 10https://gerrit.wikimedia.org/r/374169 (https://phabricator.wikimedia.org/T173731) (owner: 10Filippo Giunchedi) [09:48:02] mmmh seems that the OOM killer killed *a lot* of things there [09:48:07] ema: ^^^ [09:48:55] godog, _joe_ too ^^^^ (just remembered em.a is out IIRC) [09:51:16] oom_killer killed 85 processes :( [09:52:29] (03CR) 10Alexandros Kosiaris: [C: 031] "lol, ok" [puppet] - 10https://gerrit.wikimedia.org/r/377721 (owner: 10Muehlenhoff) [09:52:53] (03PS33) 10Gehel: cassandra: future parser and Puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) [09:54:14] (03PS1) 10Volans: depool esams [dns] - 10https://gerrit.wikimedia.org/r/377728 [09:57:46] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T175808#3603730 (10Volans) [09:57:49] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T168619#3603732 (10Volans) [09:59:13] (03CR) 10Elukey: [C: 031] Add emacs-nox to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/377721 (owner: 10Muehlenhoff) [10:04:16] (03CR) 10MarcoAurelio: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373333 (owner: 10Brian Wolff) [10:04:29] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "As an emacs user, I'm not sure if we want production servers to have both vim and emacs - too many annoying differences. At the very least" [puppet] - 10https://gerrit.wikimedia.org/r/377721 (owner: 10Muehlenhoff) [10:06:03] (03CR) 10MarcoAurelio: [C: 031] "If you could rebase it, I can take care of scheduling this for the today's EU SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373333 (owner: 10Brian Wolff) [10:06:05] !log upload apertium-srd-ita_0.9.5~r82237-1+wmf1 to apt.wikimedia.org/jessie-wikimedia/main [10:06:05] !log upload apertium-cat-srd_0.9.0~r82238-1+wmf1 to apt.wikimedia.org/jessie-wikimedia/main [10:06:12] <_joe_> moritzm: I'm sure I was the last one you expected a -1 from :P [10:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:20] (03CR) 10Volans: [C: 04-1] "I agree with Giuseppe, what's the specific need for emacs in the first place?" [puppet] - 10https://gerrit.wikimedia.org/r/377721 (owner: 10Muehlenhoff) [10:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:57] <_joe_> volans: the need is that me and moritzm could keep using our favourite editor :P [10:07:08] (03CR) 10MarcoAurelio: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367465 (owner: 10Reedy) [10:07:22] also, can we purge nano? :D [10:07:23] <_joe_> so I agree in principle, but I'd post a default emacs config for root too [10:07:26] (03CR) 10Gehel: "Correction done (I missed one of the hiera file). I re ran the same puppet-compiler job:" [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) (owner: 10Gehel) [10:07:44] <_joe_> gehel: hah, I was doing the same now :P [10:07:48] <_joe_> (re-running pcc) [10:08:07] _joe_: :) [10:08:10] <_joe_> gehel: so the difference on aqs persists [10:08:15] * _joe_ looking into it [10:08:44] _joe_: I understand why those classes are here with the change, not sure why they were not there before... [10:09:07] <_joe_> gehel: because aqs doesn't do TLS to cassandra? [10:09:13] <_joe_> elukey: ^^ am I right? [10:09:29] <_joe_> gehel: to be clear: they're there with the future parser, not with the old one [10:09:36] <_joe_> so the change per-se is a noop [10:09:43] yep [10:09:49] <_joe_> it's just a long-lingering defect with the future parser [10:10:00] _joe_ correct [10:10:01] <_joe_> I bet it has to do with '' evaluating to "true" [10:10:30] <_joe_> https://wikitech.wikimedia.org/wiki/User:Giuseppe_Lavagetto/PuppetFutureParser#The_empty_string_evaluates_to_true_in_boolean_context [10:10:33] * gehel is looking again, maybe staring hard enough will solve the issue [10:10:45] <_joe_> gehel: I'm looking into it [10:10:49] (03CR) 10MarcoAurelio: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374446 (https://phabricator.wikimedia.org/T174413) (owner: 10Ebe123) [10:10:52] thanks! [10:11:00] _joe_: it's hardly only the two of us :-) the ones on restbase are from Eric and I also got additional endorsement on IRC. but that makes sense, I'll add a stub config for root [10:11:27] <_joe_> moritzm: you know I'll commit all of prelude in my home dir on the cluster, right? [10:11:30] <_joe_> :P [10:12:16] <_joe_> like vim-lovers have been doing forever, making me hate the idea we're distributing files in homes via puppet [10:12:40] I'm all for it, reading a foreign ~/.emacs is like reading a good book! [10:13:07] <_joe_> oh I gave up on having my own .emacs; I use emacs-prelude with some personalizations [10:13:17] <_joe_> it's better than what I came up with anyways [10:13:34] (03CR) 10MarcoAurelio: [C: 031] "Change looks good, consensus is achieved too. Maybe this patch needs rebase though." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377310 (https://phabricator.wikimedia.org/T154371) (owner: 10Framawiki) [10:15:44] (03CR) 10jerkins-bot: [V: 04-1] Update mediawiki-codesniffer to 0.11.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367465 (owner: 10Reedy) [10:17:46] <_joe_> gehel: found it! profile/manifests/cassandra.pp line 45 [10:18:23] _joe_: ofc! Thanks! [10:18:39] <_joe_> gehel: so let me fix that [10:19:06] (03CR) 10Addshore: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/150042 (owner: 10Hashar) [10:19:33] _joe_: please do! I'm going out for lunch... [10:19:41] <_joe_> eheh ok [10:20:06] (03CR) 10Addshore: "check" [puppet] - 10https://gerrit.wikimedia.org/r/150042 (owner: 10Hashar) [10:20:28] (03PS34) 10Giuseppe Lavagetto: cassandra: future parser and Puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) (owner: 10Gehel) [10:21:19] (03PS35) 10Giuseppe Lavagetto: cassandra: future parser and Puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) (owner: 10Gehel) [10:21:42] PROBLEM - HHVM jobrunner on mw1259 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [10:21:52] <_joe_> moritzm: is that you ^^ ? [10:22:42] RECOVERY - HHVM jobrunner on mw1259 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.031 second response time [10:28:49] yeah, that's part of the rolling reboot to pick up the hhvm-luasandbox bugfix update, mw1259 is one of the video scalers [10:28:56] rolling restart ofc [10:33:04] (03PS2) 10MarcoAurelio: Follow-up 6d62e9ea8a. Also allow crats to remove accountcreator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373333 (owner: 10Brian Wolff) [10:37:01] (03CR) 10Alexandros Kosiaris: WIP: Allow silencing notifications for hosts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/373291 (https://phabricator.wikimedia.org/T151632) (owner: 10Alexandros Kosiaris) [10:41:55] (03PS1) 10Filippo Giunchedi: thumbor: send errors to logstash in beta too [puppet] - 10https://gerrit.wikimedia.org/r/377731 (https://phabricator.wikimedia.org/T150734) [10:42:24] (03CR) 10Giuseppe Lavagetto: [C: 031] "https://puppet-compiler.wmflabs.org/compiler02/7849/index-future.html my PS fixed the issue with aqs." [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) (owner: 10Gehel) [10:43:17] (03PS2) 10Filippo Giunchedi: thumbor: send errors to logstash in beta too [puppet] - 10https://gerrit.wikimedia.org/r/377731 (https://phabricator.wikimedia.org/T150734) [10:44:28] (03CR) 10Filippo Giunchedi: [C: 032] thumbor: send errors to logstash in beta too [puppet] - 10https://gerrit.wikimedia.org/r/377731 (https://phabricator.wikimedia.org/T150734) (owner: 10Filippo Giunchedi) [10:50:05] (03PS5) 10Alexandros Kosiaris: Allow silencing notifications for hosts [puppet] - 10https://gerrit.wikimedia.org/r/373291 (https://phabricator.wikimedia.org/T151632) [10:52:19] (03CR) 10Alexandros Kosiaris: [C: 032] "I 've removed the handler stuff and addressed comments. I 'll merge as is and let's pick it from there." [puppet] - 10https://gerrit.wikimedia.org/r/373291 (https://phabricator.wikimedia.org/T151632) (owner: 10Alexandros Kosiaris) [10:52:20] !log reboot lvs3001 [10:52:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:48] !log joal@tin Started deploy [analytics/refinery@b2e8852]: Regular analytics weekly deploy (new jars, oozie patches) [10:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:53] (03PS1) 10ArielGlenn: monitor wrapper should cd into its base dir on every run of python script [dumps] - 10https://gerrit.wikimedia.org/r/377735 (https://phabricator.wikimedia.org/T175817) [11:01:45] (03CR) 10ArielGlenn: [C: 032] monitor wrapper should cd into its base dir on every run of python script [dumps] - 10https://gerrit.wikimedia.org/r/377735 (https://phabricator.wikimedia.org/T175817) (owner: 10ArielGlenn) [11:02:01] !log joal@tin Finished deploy [analytics/refinery@b2e8852]: Regular analytics weekly deploy (new jars, oozie patches) (duration: 03m 12s) [11:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:26] !log ariel@tin Started deploy [dumps/dumps@3f119ab]: monitor will cd to dumps repo before each run [11:03:28] !log ariel@tin Finished deploy [dumps/dumps@3f119ab]: monitor will cd to dumps repo before each run (duration: 00m 02s) [11:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:22] PROBLEM - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 609.79 seconds [11:10:16] (03CR) 10Reedy: "Not gonna pass due to symlink fail" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367465 (owner: 10Reedy) [11:11:33] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1953 bytes in 0.163 second response time [11:14:34] 10Operations: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3603955 (10akosiaris) 05stalled>03Open Getting back to this [11:16:33] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1933 bytes in 0.133 second response time [11:19:43] ACKNOWLEDGEMENT - MD RAID on lvs3001 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T175820 [11:19:49] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T175820#3603968 (10ops-monitoring-bot) [11:19:58] akosiaris: I've tried to document your change https://wikitech.wikimedia.org/w/index.php?title=Icinga&type=revision&diff=1769985&oldid=1763907 [11:20:02] RECOVERY - salt-minion processes on lvs3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:20:06] but it probably can be done better [11:20:52] a link to an example commit would be missing, I think [11:21:31] jynus: nice [11:21:37] I 'll try to add some examples [11:22:31] I am trying to help because I am very happy about that going forward, but I may have failed to understand all the nuances [11:22:54] (e.g. I am assuming how it should be used and how it should be not) Please change as you wish [11:23:17] akosiaris: thanks because this is really really helpful for new db installs [11:23:20] RECOVERY - Check systemd state on lvs3001 is OK: OK - running: The system is fully operational [11:23:40] I can set a hiera value and not stress about sending false positives [11:24:09] ok seems to have worked for cp1008 [11:24:40] how does it show on icinga web when set? is it gone? [11:24:54] what do you mean gone ? [11:25:07] look at cp1008 for an idea of how is looks [11:25:08] like, what is the visible effect on the web interface [11:25:14] a muted icon [11:25:17] nothing more [11:25:22] ah, like usual [11:25:24] cool [11:25:49] ok, stupid question [11:26:00] yeah, after some discussions with mutante and volans we 've decided to not implement the auto-ack part [11:26:09] I see that and I think "oh, this is wrong, this should be enabled" [11:26:26] and I enable manually" [11:26:34] what would ihappen? [11:27:19] I like the non-autoack, I think this is much better [11:27:37] you shoudl check that the "Modified Attributes" field in the service/host is not None (that means that it was modified manually compared to the configuration) [11:28:00] in this case is None because is part of the config, but I agree, could be overlooked [11:28:05] volans: oh, I am assuming that what I should and what I will do by accident [11:28:12] is not the same :-) [11:28:19] yeah ! :D [11:29:17] so I am researching because the inevitable "I will not do the right thing, how this can break?" [11:29:41] 10Operations, 10Release-Engineering-Team, 10TCB-Team, 10WMDE-QWERTY-Team-Board: Deploy new Wikidiff2 version on beta-cluster - https://phabricator.wikimedia.org/T175818#3604022 (10Tobi_WMDE_SW) @Legoktm @greg, adding the teams #operations and #release-engineering-team as I'm not sure who exactly would be i... [11:29:44] that doesn't change that I am 100% in favor of this, as I said [11:32:59] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T175820#3604028 (10Volans) [11:33:02] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T168619#3604030 (10Volans) [11:34:12] 10Operations, 10Release-Engineering-Team, 10TCB-Team, 10User-Addshore, 10WMDE-QWERTY-Team-Board: Deploy new Wikidiff2 version on beta-cluster - https://phabricator.wikimedia.org/T175818#3604031 (10Addshore) [11:39:58] !log disabling semi-sync replication on the largest s5 database servers [11:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:36] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1954 bytes in 0.114 second response time [11:45:41] 10Operations, 10monitoring, 10Patch-For-Review: Fix Icinga checks for test/decom servers - https://phabricator.wikimedia.org/T151632#3604052 (10akosiaris) 05Open>03Resolved This seems to work fine. cp1008 and all services are marked as muted in icinga web and cp1046 (spare::system role) the same. Jaime w... [11:48:59] (03PS5) 10ArielGlenn: restructure dumps webserver, zim manifests to module/role/profile [puppet] - 10https://gerrit.wikimedia.org/r/376750 (https://phabricator.wikimedia.org/T175592) [11:49:04] (03PS2) 10Jcrespo: mariadb: Set db1097 as the main api server for s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377489 [11:49:06] (03PS1) 10Jcrespo: mariadb: Remove regular load from db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377740 [11:49:29] (03CR) 10jerkins-bot: [V: 04-1] restructure dumps webserver, zim manifests to module/role/profile [puppet] - 10https://gerrit.wikimedia.org/r/376750 (https://phabricator.wikimedia.org/T175592) (owner: 10ArielGlenn) [11:49:45] (03CR) 10Jcrespo: [C: 032] mariadb: Set db1097 as the main api server for s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377489 (owner: 10Jcrespo) [11:51:15] (03CR) 10Jcrespo: [C: 032] mariadb: Remove regular load from db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377740 (owner: 10Jcrespo) [11:51:19] (03CR) 10jenkins-bot: mariadb: Set db1097 as the main api server for s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377489 (owner: 10Jcrespo) [11:51:21] (03PS6) 10ArielGlenn: restructure dumps webserver, zim manifests to module/role/profile [puppet] - 10https://gerrit.wikimedia.org/r/376750 (https://phabricator.wikimedia.org/T175592) [11:52:20] 10Operations, 10Release-Engineering-Team, 10TCB-Team, 10User-Addshore, 10WMDE-QWERTY-Team-Board: Deploy new Wikidiff2 version on beta-cluster - https://phabricator.wikimedia.org/T175818#3603917 (10MoritzMuehlenhoff) We need to update the package on apt.wikimedia.org, then it's available on the beta clust... [11:52:54] (03Merged) 10jenkins-bot: mariadb: Remove regular load from db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377740 (owner: 10Jcrespo) [11:53:00] 10Operations, 10Release-Engineering-Team, 10TCB-Team, 10User-Addshore, 10WMDE-QWERTY-Team-Board: Deploy new Wikidiff2 version on beta-cluster - https://phabricator.wikimedia.org/T175818#3604072 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [11:54:07] (03CR) 10jenkins-bot: mariadb: Remove regular load from db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377740 (owner: 10Jcrespo) [11:54:12] moritzm: re https://phabricator.wikimedia.org/T175818 do the patches have to be cherrypicked to the debian branch? [11:57:12] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Set db1097 as the main api server for s4, Remove regular load from db1070 (duration: 00m 56s) [11:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:32] addshore: not needed, I can take care of that when building the package, I only need to know which patch(es) to pull [11:57:48] when that has been confirmed to be working fine in beta, I'd really prefer a 1.4.2 release, though [11:58:11] it makes things much clearer for any external users [11:58:13] okay! [11:58:17] I believe there are 2 commits! [11:58:27] 410ab2ff636eed296206b80a3c89aa75a50b0f8a & a1d711ebb5ce6bde66b2e4b1e650318a166895d6 [11:58:42] (03PS7) 10ArielGlenn: restructure dumps webserver, zim manifests to module/role/profile [puppet] - 10https://gerrit.wikimedia.org/r/376750 (https://phabricator.wikimedia.org/T175592) [11:58:46] both commits were started in 2016, hence the confusing commit dates :) [11:59:11] ok, can you add that to the Phab task, please? I'm off tomorrow and have a long TODO list for today, but I'll look into this on Friday [11:59:40] yup! [12:00:01] 10Operations, 10Release-Engineering-Team, 10TCB-Team, 10User-Addshore, 10WMDE-QWERTY-Team-Board: Deploy new Wikidiff2 version on beta-cluster - https://phabricator.wikimedia.org/T175818#3604077 (10Addshore) I believe you need both 410ab2ff636eed296206b80a3c89aa75a50b0f8a and a1d711ebb5ce6bde66b2e4b1e6503... [12:03:36] PROBLEM - Check Varnish expiry mailbox lag on cp1049 is CRITICAL: CRITICAL: expiry mailbox lag is 2093000 [12:07:06] !log cp1049 - backend restart, mailbox lag [12:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:39] !log stopping wikidata rebuild maintenance script on terbium [12:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:51] (03PS2) 10BBlack: ciphersuites: add TLSv1.3 variants in "high" list [puppet] - 10https://gerrit.wikimedia.org/r/365014 (https://phabricator.wikimedia.org/T170567) [12:13:41] RECOVERY - Check Varnish expiry mailbox lag on cp1049 is OK: OK: expiry mailbox lag is 0 [12:13:42] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1921 bytes in 0.112 second response time [12:14:27] !log rebooting mc1015 (spare) for some tests [12:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:42] (03PS2) 10KartikMistry: apertium-crh: Initial Debian packaging [debs/contenttranslation/apertium-crh] - 10https://gerrit.wikimedia.org/r/377390 (https://phabricator.wikimedia.org/T174765) [12:15:31] akosiaris: thanks! [12:18:18] (03CR) 10BBlack: [C: 032] ciphersuites: add TLSv1.3 variants in "high" list [puppet] - 10https://gerrit.wikimedia.org/r/365014 (https://phabricator.wikimedia.org/T170567) (owner: 10BBlack) [12:20:44] 10Operations, 10Traffic: Extending our HSTS value beyond ~1y - https://phabricator.wikimedia.org/T170598#3604105 (10BBlack) 05Open>03Resolved a:03BBlack no substantive counter-arguments in 2 months, resolving [12:22:38] (03PS3) 10BBlack: HSTS: higher and custom max-age [puppet] - 10https://gerrit.wikimedia.org/r/361864 [12:27:21] PROBLEM - cassandra-a CQL 10.64.0.167:9042 on restbase-dev1004 is CRITICAL: connect to address 10.64.0.167 and port 9042: Connection refused [12:28:21] PROBLEM - cassandra-a SSL 10.64.0.167:7001 on restbase-dev1004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [12:28:22] PROBLEM - cassandra-a service on restbase-dev1004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [12:28:41] PROBLEM - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:32:51] PROBLEM - puppet last run on mw2189 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:33:11] PROBLEM - puppet last run on db2043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:36:38] (03PS3) 10BBlack: Revert "Revert "ssl_ciphersuite: limit ECDH curves where possible"" [puppet] - 10https://gerrit.wikimedia.org/r/365078 [12:38:19] (03PS3) 10Gehel: logstash - configure new logstash100[7-9] nodes [puppet] - 10https://gerrit.wikimedia.org/r/376488 (https://phabricator.wikimedia.org/T175045) [12:44:28] (03PS1) 10Muehlenhoff: Blacklist bluetooth kernel module [puppet] - 10https://gerrit.wikimedia.org/r/377745 [12:45:34] (03PS1) 10Hashar: contint: install jsduck via gems [puppet] - 10https://gerrit.wikimedia.org/r/377746 (https://phabricator.wikimedia.org/T175764) [12:47:33] (03PS4) 10Gehel: logstash - configure new logstash100[7-9] nodes [puppet] - 10https://gerrit.wikimedia.org/r/376488 (https://phabricator.wikimedia.org/T175045) [12:47:50] (03CR) 10BBlack: [C: 032] "Validated and/or fixed min nginx package versions in prod + deployment-prep" [puppet] - 10https://gerrit.wikimedia.org/r/365078 (owner: 10BBlack) [12:48:51] (03CR) 10jerkins-bot: [V: 04-1] contint: install jsduck via gems [puppet] - 10https://gerrit.wikimedia.org/r/377746 (https://phabricator.wikimedia.org/T175764) (owner: 10Hashar) [12:49:04] (03PS4) 10BBlack: HSTS: higher and custom max-age [puppet] - 10https://gerrit.wikimedia.org/r/361864 [12:50:41] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/377746 (https://phabricator.wikimedia.org/T175764) (owner: 10Hashar) [12:51:37] (03CR) 10Hashar: "Cherry picked on CI puppetmaster" [puppet] - 10https://gerrit.wikimedia.org/r/377746 (https://phabricator.wikimedia.org/T175764) (owner: 10Hashar) [12:52:29] (03CR) 10BBlack: [C: 032] HSTS: higher and custom max-age [puppet] - 10https://gerrit.wikimedia.org/r/361864 (owner: 10BBlack) [12:53:41] RECOVERY - cassandra-a service on restbase-dev1004 is OK: OK - cassandra-a is active [12:54:01] RECOVERY - Check systemd state on restbase-dev1004 is OK: OK - running: The system is fully operational [12:55:41] RECOVERY - cassandra-a SSL 10.64.0.167:7001 on restbase-dev1004 is OK: SSL OK - Certificate restbase-dev1004-a valid until 2018-07-20 15:08:04 +0000 (expires in 310 days) [12:55:51] RECOVERY - cassandra-a CQL 10.64.0.167:9042 on restbase-dev1004 is OK: TCP OK - 0.000 second response time on 10.64.0.167 port 9042 [12:56:02] (03CR) 10Filippo Giunchedi: [C: 031] Blacklist bluetooth kernel module [puppet] - 10https://gerrit.wikimedia.org/r/377745 (owner: 10Muehlenhoff) [13:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170913T1300). Please do the needful. [13:00:05] tabbycat: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:05] (03PS5) 10Gehel: logstash - configure new logstash100[7-9] nodes [puppet] - 10https://gerrit.wikimedia.org/r/376488 (https://phabricator.wikimedia.org/T175045) [13:00:18] o/// [13:00:41] none can be tested, you can deploy right away [13:01:21] the jobqueue is exploding (and it's mostly commons refreshlinks jobs again) [13:01:31] RECOVERY - puppet last run on mw2189 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [13:01:35] (03PS4) 10Muehlenhoff: Readd rollback handling to debdeploy [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/375980 [13:01:39] (03CR) 10Muehlenhoff: Readd rollback handling to debdeploy (035 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/375980 (owner: 10Muehlenhoff) [13:01:41] RECOVERY - puppet last run on db2043 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [13:01:44] o/ [13:02:42] looks like hasharAway is, well, away :) [13:02:45] I can SWAT today! [13:03:40] tabbycat: ok, will review and deploy [13:05:08] CI looks under control [13:06:15] thanks, I'm also trying to repair the sunblind of my room so I don't have both hands in the keyboard [13:07:26] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377481 (https://phabricator.wikimedia.org/T175700) (owner: 10MarcoAurelio) [13:09:02] (03Merged) 10jenkins-bot: Temporary lift account creation limits for WM United Kingdom workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377481 (https://phabricator.wikimedia.org/T175700) (owner: 10MarcoAurelio) [13:09:11] (03CR) 10jenkins-bot: Temporary lift account creation limits for WM United Kingdom workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377481 (https://phabricator.wikimedia.org/T175700) (owner: 10MarcoAurelio) [13:09:35] (03PS8) 10ArielGlenn: restructure dumps webserver, zim manifests to module/role/profile [puppet] - 10https://gerrit.wikimedia.org/r/376750 (https://phabricator.wikimedia.org/T175592) [13:13:24] !log zfilipin@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:377481|Temporary lift account creation limits for WM United Kingdom workshop (T175700)]] (duration: 00m 49s) [13:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:37] T175700: Lift account registration on en.wikipedia for 15th September 2017 - https://phabricator.wikimedia.org/T175700 [13:15:08] 10Operations, 10HHVM: HHVM: Unknown exception - https://phabricator.wikimedia.org/T173705#3604254 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff 2.0.14 with a patch for this has been rolled out across the fleet, will check in a few days whether there's any further occurences in the logs. [13:16:45] zeljkof: there's a mediawiki patch as well there [13:16:47] tabbycat: I can not find any reason for 373333 [13:17:04] there is no phab ticket, no link to community vote... [13:17:08] zeljkof: bawolff added the patch to allow bureaucrats [13:17:12] even the parent patch does not have anything [13:17:18] (in fact he is the owner) [13:17:26] but forgot to set it to remove as well [13:18:12] 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 due to OOM in Lua→PHP→Lua calls - https://phabricator.wikimedia.org/T171392#3463588 (10MoritzMuehlenhoff) JFTR, the patch by @an... [13:18:21] tabbycat: I feel very uncomfortable deploying something like that without any reference to any discussion about it [13:18:51] zeljkof: as far as I remember, it was all 'cooked' in this channel that day [13:18:54] the parent patch does have several +1 votes [13:19:10] (03PS1) 10Elukey: [WIP] role::kafka::jumbo::broker: enable Prometheus JMX monitoring [puppet] - 10https://gerrit.wikimedia.org/r/377753 (https://phabricator.wikimedia.org/T167992) [13:19:11] is it urgent? [13:19:14] this patch is to fix the oversight from the parent one [13:19:17] no, not urgent [13:19:21] just a leftover [13:19:39] if you want to ask Bawolff on the patch, feel free [13:19:45] I would really prefer for it to have anything associated with it (phab ticket, page on a wiki...) [13:19:47] I though it'd not be controversial [13:19:58] I /think/ there was something somewhere [13:19:58] I don't think it is [13:20:13] but there is usually some trace somewhere about changes like that [13:20:31] please leave a note at the patch so Bawolff and people can comment there and answer your concerns [13:20:33] I don't have enough experience to be a judge on that :( [13:21:10] * tabbycat goes back repair the sunblind [13:21:11] tabbycat: will do [13:21:21] ok, that finishes EU SWAT then [13:21:27] !log EU SWAT finished [13:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:37] (03PS6) 10Gehel: logstash - configure new logstash100[7-9] nodes [puppet] - 10https://gerrit.wikimedia.org/r/376488 (https://phabricator.wikimedia.org/T175045) [13:22:43] (03CR) 10Zfilipin: "This was scheduled for EU SWAT today:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373333 (owner: 10Brian Wolff) [13:23:07] !log initial installation of logstash100[7-9] - T175045 [13:23:11] (03PS2) 10Elukey: [WIP] role::kafka::jumbo::broker: enable Prometheus JMX monitoring [puppet] - 10https://gerrit.wikimedia.org/r/377753 (https://phabricator.wikimedia.org/T167992) [13:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:20] T175045: setup/install logstash100[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T175045 [13:23:30] (03CR) 10Gehel: [C: 032] logstash - configure new logstash100[7-9] nodes [puppet] - 10https://gerrit.wikimedia.org/r/376488 (https://phabricator.wikimedia.org/T175045) (owner: 10Gehel) [13:23:56] (03CR) 10Volans: [C: 04-1] "reply inline" (031 comment) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/375980 (owner: 10Muehlenhoff) [13:24:46] (03PS2) 10BryanDavis: tools: add an exim sender blocklist [puppet] - 10https://gerrit.wikimedia.org/r/377697 [13:24:49] ehm. magnus says on twitter that tools-static.wmflabs.org has disappeared ? [13:25:49] seems so... [13:26:33] (03CR) 10Volans: Readd rollback handling to debdeploy (032 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/375980 (owner: 10Muehlenhoff) [13:29:12] !log update puppet compiler's facts via ./modules/puppet_compiler/files/compiler-update-facts [13:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:12] (03CR) 10Alexandros Kosiaris: [C: 031] Blacklist bluetooth kernel module [puppet] - 10https://gerrit.wikimedia.org/r/377745 (owner: 10Muehlenhoff) [13:30:44] !log T169936: Converting RESTBase dev environment to size-tiered compaction [13:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:56] T169936: Services 2017/18 Q1 goal: Start gradual roll-out of Cassandra 3 & new schema to resolve storage scaling issues and OOM errors. - https://phabricator.wikimedia.org/T169936 [13:31:51] (03CR) 10Alexandros Kosiaris: [C: 031] tools: add an exim sender blocklist [puppet] - 10https://gerrit.wikimedia.org/r/377697 (owner: 10BryanDavis) [13:35:09] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-crh: Initial Debian packaging [debs/contenttranslation/apertium-crh] - 10https://gerrit.wikimedia.org/r/377390 (https://phabricator.wikimedia.org/T174765) (owner: 10KartikMistry) [13:35:31] (03PS1) 10Gehel: logstash - ensure plugin directory is present [puppet] - 10https://gerrit.wikimedia.org/r/377755 (https://phabricator.wikimedia.org/T175045) [13:35:48] (03CR) 10Muehlenhoff: Readd rollback handling to debdeploy (031 comment) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/375980 (owner: 10Muehlenhoff) [13:37:05] (03CR) 10DCausse: [C: 031] logstash - ensure plugin directory is present [puppet] - 10https://gerrit.wikimedia.org/r/377755 (https://phabricator.wikimedia.org/T175045) (owner: 10Gehel) [13:37:36] (03CR) 10Gehel: [C: 032] logstash - ensure plugin directory is present [puppet] - 10https://gerrit.wikimedia.org/r/377755 (https://phabricator.wikimedia.org/T175045) (owner: 10Gehel) [13:39:36] (03CR) 10Eevans: [C: 031] Add emacs-nox to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/377721 (owner: 10Muehlenhoff) [13:43:57] (03PS1) 10BryanDavis: tools: Move static site to use http2 [puppet] - 10https://gerrit.wikimedia.org/r/377757 (https://phabricator.wikimedia.org/T134383) [13:44:49] (03PS2) 10Muehlenhoff: Blacklist bluetooth kernel module [puppet] - 10https://gerrit.wikimedia.org/r/377745 [13:45:33] !log upload apertium-crh_0.1.0~r81872-1+wmf1 to apt.wikimedia.org/jessie-wikimedia/main [13:45:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:20] (03CR) 10Muehlenhoff: [C: 032] Blacklist bluetooth kernel module [puppet] - 10https://gerrit.wikimedia.org/r/377745 (owner: 10Muehlenhoff) [13:48:15] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-crh-tur] - 10https://gerrit.wikimedia.org/r/377449 (https://phabricator.wikimedia.org/T174765) (owner: 10KartikMistry) [13:48:20] 10Operations, 10Wikimedia-Apache-configuration: https://test.wikipedia.org/wiki/Bug%3F?action=history doesn't show the history page, unlike https://test.wikipedia.org/w/index.php?title=Bug%3F&action=history - https://phabricator.wikimedia.org/T123276#3604377 (10matmarex) [13:48:21] (03CR) 10jerkins-bot: [V: 04-1] apertium-crh-tur: Initial Debian packaging [debs/contenttranslation/apertium-crh-tur] - 10https://gerrit.wikimedia.org/r/377449 (https://phabricator.wikimedia.org/T174765) (owner: 10KartikMistry) [13:48:53] (03CR) 10Giuseppe Lavagetto: [C: 031] cassandra: future parser and Puppet 4 compatibility (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) (owner: 10Gehel) [13:50:16] akosiaris: why https://gerrit.wikimedia.org/r/377449 - jenkins refuse to even start the CI job? [13:50:51] kart_: how should I know ? [13:51:06] jenkins can not merge the change... [13:51:21] but whether that's correct or a bug in jenkins ... [13:51:37] akosiaris: Let me take a look in the package. [13:52:09] 10Operations, 10Wikimedia-Apache-configuration: URL parameters do not work with pages that have "?" in their names - https://phabricator.wikimedia.org/T123276#3604421 (10IKhitron) [13:52:44] (03PS2) 10KartikMistry: apertium-crh-tur: Initial Debian packaging [debs/contenttranslation/apertium-crh-tur] - 10https://gerrit.wikimedia.org/r/377449 (https://phabricator.wikimedia.org/T174765) [13:52:50] (03CR) 10jerkins-bot: [V: 04-1] apertium-crh-tur: Initial Debian packaging [debs/contenttranslation/apertium-crh-tur] - 10https://gerrit.wikimedia.org/r/377449 (https://phabricator.wikimedia.org/T174765) (owner: 10KartikMistry) [13:53:36] weird. [13:54:36] maybe zuul/jenkins crapped their pants [13:57:31] OK. Let's check when hashar comes online tomorrow. No hurry for this. [14:00:58] (03PS1) 10Muehlenhoff: Extend Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/377762 [14:01:02] (03PS1) 10Gehel: logstash - add new logstash hosts to ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/377763 (https://phabricator.wikimedia.org/T175045) [14:02:34] (03CR) 10DCausse: [C: 031] logstash - add new logstash hosts to ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/377763 (https://phabricator.wikimedia.org/T175045) (owner: 10Gehel) [14:03:31] 10Operations, 10ORES, 10Scoring-platform-team (Current): Give ores admins read access to /srv/log/ores/main.log* - https://phabricator.wikimedia.org/T175736#3604437 (10Joe) I would suggest AGAINST giving access to all logs. We should have a `tail-ores` command that specifically tails the ores logs, like we d... [14:03:53] (03CR) 10Muehlenhoff: [C: 032] Extend Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/377762 (owner: 10Muehlenhoff) [14:06:08] (03CR) 10Gehel: [C: 032] logstash - add new logstash hosts to ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/377763 (https://phabricator.wikimedia.org/T175045) (owner: 10Gehel) [14:06:14] (03PS2) 10Gehel: logstash - add new logstash hosts to ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/377763 (https://phabricator.wikimedia.org/T175045) [14:07:53] (03PS2) 10Andrew Bogott: tools: Move static site to use http2 [puppet] - 10https://gerrit.wikimedia.org/r/377757 (https://phabricator.wikimedia.org/T134383) (owner: 10BryanDavis) [14:08:52] (03CR) 10Andrew Bogott: [C: 032] tools: Move static site to use http2 [puppet] - 10https://gerrit.wikimedia.org/r/377757 (https://phabricator.wikimedia.org/T134383) (owner: 10BryanDavis) [14:12:09] (03PS3) 10Rush: tools: add an exim sender blocklist [puppet] - 10https://gerrit.wikimedia.org/r/377697 (owner: 10BryanDavis) [14:14:06] 10Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3604469 (10Papaul) a:05Papaul>03elukey @elukey I think we can put the system back in production to test it out after the Dell Engineer replaced all the parts above. Thanks. [14:14:22] (03CR) 10Ottomata: "Looks good, but I dont' think we should put the prometheus stuff directly in the confluent module. Or, if we do, it should be a totally s" [puppet] - 10https://gerrit.wikimedia.org/r/377753 (https://phabricator.wikimedia.org/T167992) (owner: 10Elukey) [14:15:51] (03CR) 10Ottomata: [WIP] role::kafka::jumbo::broker: enable Prometheus JMX monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/377753 (https://phabricator.wikimedia.org/T167992) (owner: 10Elukey) [14:16:03] (03PS1) 10Muehlenhoff: Amend another alias [puppet] - 10https://gerrit.wikimedia.org/r/377769 [14:19:55] (03PS2) 10Muehlenhoff: Amend another alias [puppet] - 10https://gerrit.wikimedia.org/r/377769 [14:20:13] (03CR) 10Ottomata: "Sounds fine to me! Do we want to stop sending them to eventlogging altogether?" [puppet] - 10https://gerrit.wikimedia.org/r/377667 (https://phabricator.wikimedia.org/T171629) (owner: 10Nuria) [14:20:50] (03PS1) 10Gehel: lgostash - activate icinga alerts on new logstash100[7-9] nodes [puppet] - 10https://gerrit.wikimedia.org/r/377771 (https://phabricator.wikimedia.org/T175045) [14:20:51] (03CR) 10Rush: [C: 031] "note atm on server this is /etc/exim4/deny_senders" [puppet] - 10https://gerrit.wikimedia.org/r/377697 (owner: 10BryanDavis) [14:20:56] (03CR) 10Samtar: [C: 031] New 'abusefilter-helper' configuration for en.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377473 (https://phabricator.wikimedia.org/T175684) (owner: 10MarcoAurelio) [14:20:58] (03CR) 10Rush: [C: 032] tools: add an exim sender blocklist [puppet] - 10https://gerrit.wikimedia.org/r/377697 (owner: 10BryanDavis) [14:21:12] (03PS2) 10Gehel: lgostash - activate icinga alerts on new logstash100[7-9] nodes [puppet] - 10https://gerrit.wikimedia.org/r/377771 (https://phabricator.wikimedia.org/T175045) [14:22:03] (03CR) 10Gehel: [C: 032] lgostash - activate icinga alerts on new logstash100[7-9] nodes [puppet] - 10https://gerrit.wikimedia.org/r/377771 (https://phabricator.wikimedia.org/T175045) (owner: 10Gehel) [14:22:50] !log mobrovac@tin Started deploy [cpjobqueue/deploy@60d0a78]: Start using the EventBus infrastructure for the updateBetaFeaturesUserCounts job - T175210 [14:23:01] _joe_: fyi ^ [14:23:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:07] T175210: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210 [14:23:24] !log mobrovac@tin Finished deploy [cpjobqueue/deploy@60d0a78]: Start using the EventBus infrastructure for the updateBetaFeaturesUserCounts job - T175210 (duration: 00m 33s) [14:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:39] (03CR) 10Muehlenhoff: [C: 032] Amend another alias [puppet] - 10https://gerrit.wikimedia.org/r/377769 (owner: 10Muehlenhoff) [14:23:44] <_joe_> mobrovac: oh nice, let me snoop the access logs on the jobrunners [14:23:45] (03PS3) 10Muehlenhoff: Amend another alias [puppet] - 10https://gerrit.wikimedia.org/r/377769 [14:25:33] (03CR) 10Mforns: Add cron to purge old mediawiki data snapshots (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/376640 (https://phabricator.wikimedia.org/T162034) (owner: 10Nuria) [14:27:10] (03PS36) 10Gehel: cassandra: future parser and Puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) [14:27:14] (03PS1) 10Elukey: site.pp: assign roles to mw1307-28 [puppet] - 10https://gerrit.wikimedia.org/r/377774 (https://phabricator.wikimedia.org/T165519) [14:27:14] PROBLEM - puppet last run on bast2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:27:21] (03CR) 10Gehel: cassandra: future parser and Puppet 4 compatibility (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) (owner: 10Gehel) [14:28:00] (03PS5) 10Rush: prometheus: allow setting a specific listening address and port [puppet] - 10https://gerrit.wikimedia.org/r/374650 (https://phabricator.wikimedia.org/T169039) [14:28:38] 10Puppet, 10Beta-Cluster-Infrastructure, 10Cloud-Services, 10Release-Engineering-Team (Next), 10User-Joe: Re-think puppet management for deployment-prep - https://phabricator.wikimedia.org/T161675#3604507 (10chasemp) p:05Triage>03Normal [14:31:10] !log pooling new logstash servers (logstash100[7-9]) - T175045 [14:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:23] T175045: setup/install logstash100[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T175045 [14:31:42] everything should go smoothly with those new servers, but please ping me if you see missing logs or anything strange with logstash / kibana [14:31:59] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=logstash [14:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:45] (03CR) 10Muehlenhoff: Readd rollback handling to debdeploy (032 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/375980 (owner: 10Muehlenhoff) [14:32:55] (03PS5) 10Muehlenhoff: Readd rollback handling to debdeploy [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/375980 [14:35:10] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work): decommission logstash100[1-3] - https://phabricator.wikimedia.org/T175830#3604520 (10Gehel) [14:35:33] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work), 10Patch-For-Review: all log producers need to use the logstash LVS endpoint - https://phabricator.wikimedia.org/T175242#3604534 (10Gehel) [14:35:35] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work): decommission logstash100[1-3] - https://phabricator.wikimedia.org/T175830#3604533 (10Gehel) [14:36:25] (03CR) 10Volans: [C: 04-1] Readd rollback handling to debdeploy (031 comment) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/375980 (owner: 10Muehlenhoff) [14:37:03] gerrit gerrit, I just replied to a comment on a previous PS... why bothering reporting the C: -1? [14:37:27] it's not actually true anymore given the new PS reset it already, maybe a bug in wikibugs ? [14:50:55] (03PS3) 10Aklapper: Phabricator: Override the frog token's label [puppet] - 10https://gerrit.wikimedia.org/r/371660 (https://phabricator.wikimedia.org/T173208) (owner: 10Greg Grossmeier) [14:51:02] (03PS6) 10EBernhardson: Setup Cirrus MLR models for top 20 language AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377393 (https://phabricator.wikimedia.org/T175771) [14:51:07] (03PS1) 10EBernhardson: Configure enwiki to use CirrusSearch MLR by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377776 (https://phabricator.wikimedia.org/T175772) [14:51:35] (03CR) 10Nuria: Add cron to purge old mediawiki data snapshots (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/376640 (https://phabricator.wikimedia.org/T162034) (owner: 10Nuria) [14:52:21] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work), 10Patch-For-Review: setup/install logstash100[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T175045#3604585 (10Gehel) The new logstash servers are up, running and pooled. We still need to switch all log producers to them and deco... [14:53:26] (03CR) 10jerkins-bot: [V: 04-1] Configure enwiki to use CirrusSearch MLR by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377776 (https://phabricator.wikimedia.org/T175772) (owner: 10EBernhardson) [14:53:47] 10Operations, 10Epic, 10Goal, 10Services (doing), and 2 others: Services Q1 2017/18 goal: Begin migrating job queue processing to multi-DC enabled eventbus infrastructure. - https://phabricator.wikimedia.org/T169937#3604590 (10mobrovac) [14:53:50] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#3604587 (10mobrovac) 05Open>03Resolved The job is being double-produced now, so resolving. [14:54:41] (03CR) 10Nuria: "Yes, we should stop sending these events altogether. I thought we could 1) blacklist 2) scoop and 3) remove code from mediawiki that sends" [puppet] - 10https://gerrit.wikimedia.org/r/377667 (https://phabricator.wikimedia.org/T171629) (owner: 10Nuria) [14:55:23] (03CR) 10DCausse: [C: 031] Setup Cirrus MLR models for top 20 language AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377393 (https://phabricator.wikimedia.org/T175771) (owner: 10EBernhardson) [14:55:41] RECOVERY - puppet last run on bast2001 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [14:58:24] (03PS2) 10EBernhardson: Configure enwiki to use CirrusSearch MLR by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377776 (https://phabricator.wikimedia.org/T175772) [14:59:05] (03PS1) 10Rush: herron wmcs wide root [labs/private] - 10https://gerrit.wikimedia.org/r/377778 [14:59:39] (03CR) 10Mforns: Add cron to purge old mediawiki data snapshots (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/376640 (https://phabricator.wikimedia.org/T162034) (owner: 10Nuria) [15:01:10] (03CR) 10Rush: [V: 032 C: 032] herron wmcs wide root [labs/private] - 10https://gerrit.wikimedia.org/r/377778 (owner: 10Rush) [15:01:56] 10Operations, 10Cassandra, 10Epic, 10Goal, and 2 others: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3604628 (10Eevans) [15:01:57] 10Operations, 10Epic, 10Goal, 10Services (doing): Consider a lower virtual node count - https://phabricator.wikimedia.org/T172149#3604626 (10Eevans) 05Open>03declined After some discussion it was determined that it wasn't obvious that deviating from the default would be worth the trade-offs, and that i... [15:09:25] !log demon@tin Synchronized php-1.30.0-wmf.17/extensions/AbuseFilter/: backporting I13051038 (duration: 00m 52s) [15:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:18] (03PS8) 10Andrew Bogott: prometheus::web to apache [puppet] - 10https://gerrit.wikimedia.org/r/377332 (https://phabricator.wikimedia.org/T151009) [15:11:36] godog: ready for me to merge https://gerrit.wikimedia.org/r/#/c/377332/ ? [15:14:12] (03CR) 10DCausse: [C: 031] Configure enwiki to use CirrusSearch MLR by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377776 (https://phabricator.wikimedia.org/T175772) (owner: 10EBernhardson) [15:14:20] (03Abandoned) 10Ottomata: Add class to periodically run dfsadmin -fetchImage [puppet/cdh] - 10https://gerrit.wikimedia.org/r/377350 (owner: 10Ottomata) [15:14:48] andrewbogott: sure, go for it! [15:15:26] (03CR) 10Andrew Bogott: [C: 032] prometheus::web to apache [puppet] - 10https://gerrit.wikimedia.org/r/377332 (https://phabricator.wikimedia.org/T151009) (owner: 10Andrew Bogott) [15:18:32] (03PS9) 10ArielGlenn: restructure dumps webserver, zim manifests to module/role/profile [puppet] - 10https://gerrit.wikimedia.org/r/376750 (https://phabricator.wikimedia.org/T175592) [15:18:56] (03CR) 10jerkins-bot: [V: 04-1] restructure dumps webserver, zim manifests to module/role/profile [puppet] - 10https://gerrit.wikimedia.org/r/376750 (https://phabricator.wikimedia.org/T175592) (owner: 10ArielGlenn) [15:19:23] PROBLEM - puppet last run on prometheus2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:20:33] PROBLEM - puppet last run on prometheus2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:20:53] !log demon@tin Synchronized php-1.30.0-wmf.17/includes/diff/DifferenceEngine.php: I2ecb6030 (duration: 00m 49s) [15:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:43] !log demon@tin Synchronized php-1.30.0-wmf.17/includes/filerepo/file/LocalFile.php: I2ecb6030 (duration: 00m 49s) [15:21:54] (03PS10) 10ArielGlenn: restructure dumps webserver, zim manifests to module/role/profile [puppet] - 10https://gerrit.wikimedia.org/r/376750 (https://phabricator.wikimedia.org/T175592) [15:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:07] andrewbogott: didn't seem prometheus200[34] liked the change, expected? [15:22:32] !log demon@tin Synchronized php-1.30.0-wmf.18/includes/diff/DifferenceEngine.php: I2ecb6030 (duration: 00m 49s) [15:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:22] !log demon@tin Synchronized php-1.30.0-wmf.18/includes/filerepo/file/LocalFile.php: I2ecb6030 (duration: 00m 49s) [15:23:28] godog: it's because multiple includes of ::web mean duplicates of class { '::nginx': [15:23:28] ensure => absent, [15:23:28] } [15:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:39] godog: I'm thinking about how best to handle this… maybe just if ! defined [15:23:49] except I think we're supposed to not use 'defined' these days? [15:24:18] andrewbogott: ah, heh given it is transitional anyway maybe it is ok [15:24:23] PROBLEM - puppet last run on prometheus1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:25:43] PROBLEM - Check systemd state on bast4001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:26:23] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/etc/apache2/prometheus.d],Service[apache2] [15:27:56] (03PS1) 10Andrew Bogott: prometheus: avoid duplicate definitions of nginx class [puppet] - 10https://gerrit.wikimedia.org/r/377784 [15:29:08] (03CR) 10Andrew Bogott: [C: 032] prometheus: avoid duplicate definitions of nginx class [puppet] - 10https://gerrit.wikimedia.org/r/377784 (owner: 10Andrew Bogott) [15:31:34] RECOVERY - puppet last run on prometheus2003 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [15:32:54] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3604775 (10elukey) Checked on rack tables the row assignments and this is the slit that I've done: mw1307 - ROW A - videoscaler mw1308-11 (4 hosts) - ROW A - j... [15:33:44] godog: looks like there was an ordering issue but everything should settle down shortly. [15:34:33] RECOVERY - puppet last run on prometheus1004 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [15:36:05] andrewbogott: fantastic [15:36:48] godog: the codfw hosts are settled now, want to double-check that they're still doing what you'd expect? [15:36:53] RECOVERY - puppet last run on prometheus2004 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [15:37:53] RECOVERY - Check systemd state on bast4001 is OK: OK - running: The system is fully operational [15:38:43] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [15:38:59] (03PS11) 10ArielGlenn: restructure dumps webserver, zim manifests to module/role/profile [puppet] - 10https://gerrit.wikimedia.org/r/376750 (https://phabricator.wikimedia.org/T175592) [15:39:24] andrewbogott: yup, LGTM, I tried some prometheus based dashboard and looks all good to me, thanks! [15:39:46] sounds good, thanks for checking [15:42:03] PROBLEM - puppet last run on prometheus1003 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/etc/apache2/prometheus.d],Service[apache2] [15:42:23] PROBLEM - Check systemd state on prometheus1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:44:26] (03PS1) 10Volans: WMCS: install Cumin for WMCS admins [puppet] - 10https://gerrit.wikimedia.org/r/377787 (https://phabricator.wikimedia.org/T175712) [15:44:49] (03CR) 10jerkins-bot: [V: 04-1] WMCS: install Cumin for WMCS admins [puppet] - 10https://gerrit.wikimedia.org/r/377787 (https://phabricator.wikimedia.org/T175712) (owner: 10Volans) [15:44:54] RECOVERY - MariaDB Slave Lag: s5 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 281.95 seconds [15:45:03] PROBLEM - Check systemd state on bast3002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:46:14] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/etc/apache2/prometheus.d],Service[apache2] [15:47:31] (03PS2) 10Volans: WMCS: install Cumin for WMCS admins [puppet] - 10https://gerrit.wikimedia.org/r/377787 (https://phabricator.wikimedia.org/T175712) [15:47:44] I'm cleaning up those bastion puppet issues [15:48:01] andrewbogott: ah ok! I was about to ask, I ran puppet on bast3002 [15:48:18] I'm assuming same state on prometheus1003 ? [15:49:35] yep, I'll do that too [15:50:13] RECOVERY - Check systemd state on bast3002 is OK: OK - running: The system is fully operational [15:51:23] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [15:52:13] RECOVERY - puppet last run on prometheus1003 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [15:52:33] RECOVERY - Check systemd state on prometheus1003 is OK: OK - running: The system is fully operational [15:55:14] PROBLEM - HHVM rendering on mw1277 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.005 second response time [15:56:23] RECOVERY - HHVM rendering on mw1277 is OK: HTTP OK: HTTP/1.1 200 OK - 74150 bytes in 0.553 second response time [15:57:02] 10Operations, 10ORES, 10Scoring-platform-team (Current): Give ores admins read access to /srv/log/ores/main.log* - https://phabricator.wikimedia.org/T175736#3604828 (10awight) @Joe Great, thanks for the pointer! I can put together that tail wrapper shortly, but in the meantime I'm working on T169586 which w... [16:02:48] (03PS1) 10Ottomata: Increase druid broker query cache size to 2G [puppet] - 10https://gerrit.wikimedia.org/r/377791 [16:03:19] (03CR) 10Ottomata: [C: 032] Increase druid broker query cache size to 2G [puppet] - 10https://gerrit.wikimedia.org/r/377791 (owner: 10Ottomata) [16:08:06] !log restarting druid-brokers with increase in query cache size [16:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:50] (03PS12) 10ArielGlenn: restructure dumps webserver, zim manifests to module/role/profile [puppet] - 10https://gerrit.wikimedia.org/r/376750 (https://phabricator.wikimedia.org/T175592) [16:14:57] (03PS1) 10Alexandros Kosiaris: chmod uwsgi logs to 0644 [puppet] - 10https://gerrit.wikimedia.org/r/377794 (https://phabricator.wikimedia.org/T175736) [16:15:59] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Give ores admins read access to /srv/log/ores/main.log* - https://phabricator.wikimedia.org/T175736#3604853 (10akosiaris) The above gerrit change should make the main.log files readable by everyone and should fix the issue while a... [16:17:51] (03CR) 10Daniel Kinzler: [C: 031] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368120 (https://phabricator.wikimedia.org/T171807) (owner: 10Smalyshev) [16:19:04] (03PS1) 10Volans: keyholder: add Cumin openstack master key [labs/private] - 10https://gerrit.wikimedia.org/r/377796 (https://phabricator.wikimedia.org/T175712) [16:19:18] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Give ores admins read access to /srv/log/ores/main.log* - https://phabricator.wikimedia.org/T175736#3604877 (10awight) @akosiaris Could you point me to the config that propagates those log messages to logstash? I was looking at e... [16:22:43] !log rebooting labtestvirt2002 [16:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:47] 10Operations, 10hardware-requests, 10Discovery-Wikidata-Query-Service-Sprint: decommission wdqs100[12] - https://phabricator.wikimedia.org/T175595#3604884 (10Gehel) [16:25:45] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=wdqs1001.eqiad.wmnet [16:25:52] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=wdqs1002.eqiad.wmnet [16:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:00] (03CR) 10Volans: [V: 032 C: 032] keyholder: add Cumin openstack master key [labs/private] - 10https://gerrit.wikimedia.org/r/377796 (https://phabricator.wikimedia.org/T175712) (owner: 10Volans) [16:26:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:31] !log starting decommissioning of wdqs100[12] - T175595 [16:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:45] T175595: decommission wdqs100[12] - https://phabricator.wikimedia.org/T175595 [16:30:52] (03CR) 10Volans: "Compiler results available here:" [puppet] - 10https://gerrit.wikimedia.org/r/377787 (https://phabricator.wikimedia.org/T175712) (owner: 10Volans) [16:32:37] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Give ores admins read access to /srv/log/ores/main.log* - https://phabricator.wikimedia.org/T175736#3604901 (10akosiaris) That would be https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/service/m... [16:33:32] (03PS2) 10Alexandros Kosiaris: chmod uwsgi logs to 0644 [puppet] - 10https://gerrit.wikimedia.org/r/377794 (https://phabricator.wikimedia.org/T175736) [16:33:57] (03PS1) 10Gehel: wdqs - decommissioning wdqs100[12] [puppet] - 10https://gerrit.wikimedia.org/r/377797 (https://phabricator.wikimedia.org/T175595) [16:34:13] (03CR) 10Elukey: "Summary available in https://phabricator.wikimedia.org/T165519#3604775" [puppet] - 10https://gerrit.wikimedia.org/r/377774 (https://phabricator.wikimedia.org/T165519) (owner: 10Elukey) [16:34:13] 10Operations, 10hardware-requests, 10Discovery-Wikidata-Query-Service-Sprint, 10Patch-For-Review: decommission wdqs100[12] - https://phabricator.wikimedia.org/T175595#3604906 (10Smalyshev) Was ldf server moved from wdqs1001? If not we should move it first thing. [16:34:21] (03CR) 10Awight: [C: 031] "No sensitive data, this should be safe!" [puppet] - 10https://gerrit.wikimedia.org/r/377794 (https://phabricator.wikimedia.org/T175736) (owner: 10Alexandros Kosiaris) [16:35:03] (03CR) 10Volans: "Of course the puppet compiler is only for Prod hosts, I will cherry-pick the change into my own puppetmaster in labs and check the changes" [puppet] - 10https://gerrit.wikimedia.org/r/377787 (https://phabricator.wikimedia.org/T175712) (owner: 10Volans) [16:36:22] 10Operations, 10hardware-requests, 10Discovery-Wikidata-Query-Service-Sprint, 10Patch-For-Review: decommission wdqs100[12] - https://phabricator.wikimedia.org/T175595#3604912 (10Gehel) [16:36:27] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Give ores admins read access to /srv/log/ores/main.log* - https://phabricator.wikimedia.org/T175736#3604913 (10awight) @akosiaris Thanks, I see the messages in logstash now! I'll tweak the ORES dashboard to include them. [16:42:44] !log gehel@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=wdqs1002.eqiad.wmnet [16:42:52] !log gehel@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=wdqs1001.eqiad.wmnet [16:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:24] 10Operations, 10hardware-requests, 10Discovery-Wikidata-Query-Service-Sprint, 10Patch-For-Review: decommission wdqs100[12] - https://phabricator.wikimedia.org/T175595#3604956 (10Gehel) @Smalyshev yes, [[ https://github.com/wikimedia/puppet/blob/production/hieradata/role/common/cache/misc.yaml#L137-L139 | l... [16:47:07] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Give ores admins read access to /srv/log/ores/main.log* - https://phabricator.wikimedia.org/T175736#3604957 (10awight) Correct to my last comment—the logs are already in logstash. [16:52:03] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/377797 (https://phabricator.wikimedia.org/T175595) (owner: 10Gehel) [16:53:58] (03CR) 10Gehel: [C: 032] wdqs - decommissioning wdqs100[12] [puppet] - 10https://gerrit.wikimedia.org/r/377797 (https://phabricator.wikimedia.org/T175595) (owner: 10Gehel) [16:55:41] 10Operations, 10hardware-requests, 10Discovery-Wikidata-Query-Service-Sprint, 10Patch-For-Review: decommission wdqs100[12] - https://phabricator.wikimedia.org/T175595#3604978 (10Gehel) a:05Gehel>03RobH @RobH I think my job is done here, let me know if you need anything else from me. [16:56:18] 10Operations, 10hardware-requests, 10Discovery-Wikidata-Query-Service-Sprint, 10Patch-For-Review: decommission wdqs100[12] - https://phabricator.wikimedia.org/T175595#3604983 (10Gehel) [16:56:56] gehel: yep, thx! [16:57:09] once its to the non interrupt steps, its easiest to hand off to me [16:57:31] they are documented in case someone else has to do them, but easier to let soemone who has switch level access do them [16:57:33] robh: ok, so thanks for taking care of the rest! [17:00:33] RECOVERY - Host labtestvirt2002 is UP: PING WARNING - Packet loss = 61%, RTA = 36.21 ms [17:07:37] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3391572 (10awight) Random tool which seems to give a good estimate of RSS not including copy-on-write pages from the parent process:... [17:20:11] 10Operations, 10ops-eqiad, 10DBA, 10Phabricator: Decommission db1048 (was Move m3 slave to db1059) - https://phabricator.wikimedia.org/T175679#3605096 (10mmodell) @jcrespo Any time will work for me, there is scheduled maintenance at midnight tonight (UTC) but if it's just a few seconds of downtime I think... [17:24:50] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:25:50] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 277 bytes in 7.911 second response time [17:27:28] 10Operations, 10Release-Engineering-Team, 10TCB-Team, 10User-Addshore, 10WMDE-QWERTY-Team-Board: Deploy new Wikidiff2 version on beta-cluster - https://phabricator.wikimedia.org/T175818#3605122 (10MaxSem) This should definitely be a new version, 1.5. [17:34:42] (03CR) 10Thcipriani: "question inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/377304 (owner: 10Chad) [17:40:10] RECOVERY - DPKG on labtestvirt2003 is OK: All packages OK [17:40:20] RECOVERY - nova-compute process on labtestvirt2003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [17:40:21] RECOVERY - dhclient process on labtestvirt2003 is OK: PROCS OK: 0 processes with command name dhclient [17:40:21] RECOVERY - configured eth on labtestvirt2003 is OK: OK - interfaces up [17:40:30] RECOVERY - MD RAID on labtestvirt2003 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [17:40:50] RECOVERY - kvm ssl cert on labtestvirt2003 is OK: Cert /etc/ssl/localcerts/labvirt-star.codfw.wmnet.crt will not expire for at least 30 days. [17:40:50] RECOVERY - Disk space on labtestvirt2003 is OK: DISK OK [17:42:23] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:43:23] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 277 bytes in 7.553 second response time [17:43:33] RECOVERY - salt-minion processes on labtestvirt2003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:44:03] RECOVERY - puppet last run on labtestvirt2003 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [17:44:03] RECOVERY - NTP on labtestvirt2003 is OK: NTP OK: Offset -0.09289219975 secs [17:50:33] PROBLEM - Check systemd state on wtp1031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:51:11] (03PS1) 10Zppix: Change logo for huwiktonary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377805 (https://phabricator.wikimedia.org/T175483) [17:53:50] 10Operations, 10ops-eqiad, 10DBA, 10Phabricator: Decommission db1048 (was Move m3 slave to db1059) - https://phabricator.wikimedia.org/T175679#3605223 (10jcrespo) Let's wait a bit more. I may have to talk to you abut setting up TLS for php and changing passwords, let's talk and aim for next week (but we sh... [17:53:54] (03PS1) 10Krinkle: webperf: Use minimum sampling of 5 hits/min for by-country breakdown in navtiming [puppet] - 10https://gerrit.wikimedia.org/r/377806 (https://phabricator.wikimedia.org/T166390) [17:54:07] jouncebot: refresh [17:54:09] I refreshed my knowledge about deployments. [17:54:17] (03CR) 10jerkins-bot: [V: 04-1] webperf: Use minimum sampling of 5 hits/min for by-country breakdown in navtiming [puppet] - 10https://gerrit.wikimedia.org/r/377806 (https://phabricator.wikimedia.org/T166390) (owner: 10Krinkle) [17:55:05] !log rolling restart of pdfrender service in equiad after hang T174916 T172815 [17:55:12] (03PS2) 10Zppix: Change logo for huwiktonary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377805 (https://phabricator.wikimedia.org/T175483) [17:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:21] T174916: electron/pdfrender hangs - https://phabricator.wikimedia.org/T174916 [17:55:21] T172815: Improve stability and maintainability of our browser-based PDF render service - https://phabricator.wikimedia.org/T172815 [17:57:53] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#3605236 (10GWicke) Given the useful information we have in this task, I am proposing to widen the scope beyond the first job, toward... [18:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170913T1800). Please do the needful. [18:00:05] Framawiki, MatmaRex, Smalyshev, and Zppix: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:06] o/ [18:00:16] here [18:00:53] hello [18:01:21] I can SWAT. [18:02:08] (03PS2) 10Niharika29: Enable Timeless on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377310 (https://phabricator.wikimedia.org/T154371) (owner: 10Framawiki) [18:02:10] (03CR) 10Niharika29: [C: 032] Enable Timeless on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377310 (https://phabricator.wikimedia.org/T154371) (owner: 10Framawiki) [18:02:23] RECOVERY - IPMI Temperature on labtestvirt2003 is OK: Sensor Type(s) Temperature Status: OK [18:02:51] o/ [18:03:17] just at time [18:05:03] (03Merged) 10jenkins-bot: Enable Timeless on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377310 (https://phabricator.wikimedia.org/T154371) (owner: 10Framawiki) [18:06:16] (03CR) 10jenkins-bot: Enable Timeless on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377310 (https://phabricator.wikimedia.org/T154371) (owner: 10Framawiki) [18:08:08] framawiki: Your patch is on mwdebug1002. [18:09:31] (03PS3) 10Niharika29: Add setup for https://www.mediawiki.org/ontology [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368120 (https://phabricator.wikimedia.org/T171807) (owner: 10Smalyshev) [18:09:33] (03CR) 10Niharika29: [C: 032] Add setup for https://www.mediawiki.org/ontology [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368120 (https://phabricator.wikimedia.org/T171807) (owner: 10Smalyshev) [18:09:40] Niharika: it's good for me [18:10:22] framawiki: Okay, going out. [18:11:28] (03Merged) 10jenkins-bot: Add setup for https://www.mediawiki.org/ontology [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368120 (https://phabricator.wikimedia.org/T171807) (owner: 10Smalyshev) [18:11:40] Niharika: fyi the only reason the patch that my patch links to is a revert is due to a server-side issue due to scap not a code/patch issue just so you dont worry :P [18:11:40] (03CR) 10jenkins-bot: Add setup for https://www.mediawiki.org/ontology [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368120 (https://phabricator.wikimedia.org/T171807) (owner: 10Smalyshev) [18:12:08] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Enable Timeless on frwiki T154371 (duration: 00m 50s) [18:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:19] T154371: Review and deploy Timeless skin - https://phabricator.wikimedia.org/T154371 [18:15:01] SMalyshev: Your patch is on mwdebug1002. [18:15:15] checking [18:16:11] hmm still getting 404 for https://www.mediawiki.org/ontology [18:17:53] 10Operations, 10Release-Engineering-Team, 10TCB-Team, 10User-Addshore, 10WMDE-QWERTY-Team-Board: Deploy new Wikidiff2 version on beta-cluster - https://phabricator.wikimedia.org/T175818#3605304 (10Legoktm) I tagged `1.5.0`, and prepared packaging changes: * Merge tag '1.5.0' into debian - https://gerrit... [18:18:05] SMalyshev: Some caching issues? [18:18:16] SMalyshev: Oh sorry! [18:18:18] Niharika: does varnish cache 404s? [18:18:28] It's MatmaRex's patch. [18:18:36] Niharika: ah ok :) [18:18:42] MatmaRex: ^^^ [18:18:50] Niharika: yeah, i'm here [18:18:59] Your patch is on mwdebug1002. [18:19:43] SMalyshev: easy to check, hitting an unknown page in an incognito window, then reloading ~10s later shows an Age: header that's increasing. so yes [18:20:03] SMalyshev: Yours is there too now. [18:20:09] Niharika: looks good [18:20:17] ah, ok. but in this case it wasn't that... re-checking now [18:20:20] 10Operations, 10fundraising-tech-ops: Enumerate remaining unported stats - https://phabricator.wikimedia.org/T175850#3605313 (10cwdent) [18:20:44] Niharika: yeah, all good now, thanks! [18:23:10] SMalyshev: MatmaRex: Okay, syncing it then. [18:24:26] thanks [18:25:00] !log niharika29@tin Synchronized docroot/mediawiki/ontology/: Add setup for https://www.mediawiki.org/ontology T171807 (duration: 00m 49s) [18:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:14] T171807: Create ontology URL for mediawiki - https://phabricator.wikimedia.org/T171807 [18:25:50] (03PS3) 10Niharika29: Change logo for huwiktonary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377805 (https://phabricator.wikimedia.org/T175483) (owner: 10Zppix) [18:25:53] (03CR) 10Niharika29: [C: 032] Change logo for huwiktonary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377805 (https://phabricator.wikimedia.org/T175483) (owner: 10Zppix) [18:26:30] !log niharika29@tin Synchronized php-1.30.0-wmf.18/extensions/VisualEditor/: Make ve.dm.Change part of core module T175828 (duration: 00m 50s) [18:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:43] T175828: [wmf.18] VisualEditor accidentally requires ES6 syntax support (for editing and visual diffs) - https://phabricator.wikimedia.org/T175828 [18:27:18] !log gerrit2002 - re-enabled puppet - had forgotten to enable it again after recently tryning to test the gerrit->logstash change [18:27:27] thanks Niharika! i can confirm it works in production now [18:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:33] 10Operations, 10Wikimedia-Apache-configuration: URL parameters do not work with pages that have "?" in their names - https://phabricator.wikimedia.org/T123276#1925115 (10Mainframe98) I noticed another related bug while testing this, based on this example >>! In T123276#1949228, @BBlack wrote: > https://test.w... [18:28:41] (03Merged) 10jenkins-bot: Change logo for huwiktonary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377805 (https://phabricator.wikimedia.org/T175483) (owner: 10Zppix) [18:28:54] (03CR) 10jenkins-bot: Change logo for huwiktonary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377805 (https://phabricator.wikimedia.org/T175483) (owner: 10Zppix) [18:31:24] 10Operations, 10Beta-Cluster-Infrastructure, 10TCB-Team, 10Release-Engineering-Team (Watching / External), and 2 others: Deploy new Wikidiff2 version on beta-cluster - https://phabricator.wikimedia.org/T175818#3605370 (10greg) [18:31:34] (03CR) 10Dzahn: "oh hah! thanks for the explanation" [puppet] - 10https://gerrit.wikimedia.org/r/377269 (https://phabricator.wikimedia.org/T172333) (owner: 10Alexandros Kosiaris) [18:31:35] Zppix: Your change is on mwdebug1002 too. [18:31:44] Niharika: looking [18:32:30] Niharika: good to push to prod [18:33:58] Zppix: On it. [18:34:11] !log niharika29@tin Synchronized static/images/project-logos/: Change logo for huwiktionary T175483 (duration: 00m 49s) [18:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:27] T175483: Change the logo of huwiktionary - https://phabricator.wikimedia.org/T175483 [18:34:41] Zppix: Done. [18:34:49] ty [18:35:17] SWAT over! 🤘🏼 [18:42:04] 10Operations, 10Fundraising-Backlog, 10fundraising-tech-ops: Enumerate remaining unported stats - https://phabricator.wikimedia.org/T175850#3605447 (10DStrine) [18:46:20] PROBLEM - Check systemd state on wtp1032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:47:23] thanks Niharika ! [18:50:05] (03CR) 10Dzahn: "@Akosiaris upstream pull request here https://github.com/puppetlabs/puppetlabs-stdlib/pull/816/commits" [puppet] - 10https://gerrit.wikimedia.org/r/377355 (owner: 10Dzahn) [18:52:51] urandom: , yt? [19:00:04] No patches in the queue for this window. Wheeee! [19:00:40] RECOVERY - DPKG on labmon1001 is OK: All packages OK [19:01:48] ottomata: yeah [19:01:59] ottomata: o/ [19:02:15] heyaaaa! luca and I are looking into prometheus jmx exporter stuff [19:02:27] was wondering: why scap? [19:02:33] ¯\_(ツ)_/¯ [19:02:35] how else? [19:02:36] haha [19:02:40] debian package? [19:03:00] i guess? or even a straight git clone [19:03:09] with scap, we gotta manage list of hosts to deploy to [19:03:16] a straight git clone? [19:03:21] ya in puppet [19:03:22] git::clone [19:03:29] (03CR) 10Chad: [C: 032] group1 to wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377700 (owner: 10Chad) [19:03:33] or, i dunno, the jar isn't TOO big, just put it in puppet :? [19:03:34] haha [19:03:40] maybe folsk would,n't like that [19:03:52] NO [19:03:57] DO NOT PUT JARS IN GIT REPOS [19:04:00] I STAB [19:04:30] :) [19:04:40] ottomata: there you go, as good an answer as any [19:04:43] haha [19:04:49] ottomata: i could see a Debian package here [19:05:13] might be easiest [19:05:14] i might even be tempted upload such a creature [19:05:20] tempted to, that is [19:05:21] would be super simple [19:05:28] ha! [19:05:28] install a single file in /usr/share/java or seomthing [19:05:37] That's the same thing.... [19:05:42] It's a git repo with a jar :p [19:05:49] packaging Java for Debian is never super simple [19:05:54] relatively speaking [19:05:57] yeah, i you were buidling from source [19:06:12] *if*? [19:06:24] if only we have a deployment system with built in support for binary objects--specifically jars at the moment.... [19:06:25] are you? [19:06:34] haha no_justification yes [19:06:36] scap is being used [19:06:37] but [19:06:50] we don't want a deployment system to be responsible for deploying the monitoring for a system [19:06:55] this isn't about deploying an app [19:07:11] ottomata: if i were putting together a Debian package, the source package would build the exporter jar from source [19:07:18] that would be grand! [19:07:23] that would be much harder though [19:07:36] especially to not build using internet/maven stuff [19:08:05] if you have to do that and break that rule, you might as well just check in the jar [19:08:32] urandom: another q about this: do you use prometheus for alets? [19:08:34] alerts* [19:08:35] ? [19:08:43] nope [19:08:53] just graphite/icinga? [19:09:03] really not using it for anything atm, just trying to use it [19:09:06] ah [19:09:07] hm [19:09:13] icinga for alerts [19:09:24] yeah, we are considering using it for the new kafka cluster, because we ahve to make changes to the jmxtrans package to run it on stretch [19:09:32] and the jmxtrans version we are using in prod is really old [19:09:42] so, if we want to update, we might as well switch to prometheus now...we think anyway [19:10:27] no_justification: not much difference in checking in jars to repos for scap deployment, as there is checking in all node dependencies in a deploy repo [19:10:48] a dep version change might make one huge commit of dependencies [19:11:06] jars to repos for deployment/deb building* [19:11:11] !log T172384: Upgrading restbase-dev1004 to Cassandra 3.11.0-wmf4 (canary) [19:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:25] T172384: OOM exceptions in dev environment - https://phabricator.wikimedia.org/T172384 [19:11:27] there certainly a difference in checking them into puppet though. i am not really for that. :) [19:11:52] but a repo that explicitly exists to hold some deps / binary stuff, seems ok to me [19:12:09] ottomata: how is that better than using scap? [19:12:20] or rather [19:12:27] my main problem with using scap is maintaining the list of hosts [19:12:30] and doing manual deployments [19:12:34] how is that so much better that it warrants doing something *different*? [19:13:13] !log demon@tin Synchronized php: symlink thingie (duration: 00m 49s) [19:13:17] i want to be able to include a class on a node and get metrics from jmx [19:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:42] which, we can now, but you also need to keep your host list in your scap config up to date (you do this with dsh_groups I think?) [19:14:10] and, someone running scap deploy for that repo affects everybody who uses jmx_exporter (ya you can limit, i know) [19:15:20] ottomata: i think a debian package would be reasonably easy for this one [19:15:35] it doesn't really have any dependencies [19:16:22] 10Operations, 10media-storage, 10User-fgiunchedi: Running swiftrepl is not puppetized - https://phabricator.wikimedia.org/T162123#3605612 (10Dzahn) [19:16:25] 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3605611 (10Dzahn) [19:19:08] ottomata: actually, it's only a couple of classes, if push really came to shove, you could almost package it without using the maven build [19:19:21] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 to wmf.18 [19:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:26] !log T172384: Upgrading restbase-dev100[5-6] to Cassandra 3.11.0-wmf4 [19:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:41] T172384: OOM exceptions in dev environment - https://phabricator.wikimedia.org/T172384 [19:22:06] urandom: aye [19:25:19] PROBLEM - puppet last run on restbase-dev1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cassandra] [19:25:32] bah [19:26:58] PROBLEM - EventBus HTTP Error Rate -4xx + 5xx- on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [10.0] [19:27:19] PROBLEM - puppet last run on restbase-dev1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cassandra] [19:28:02] Pchelolo: Failed validating . None is not of type 'string' [19:28:33] i think \"request_id\": null, again [19:28:42] this time it is a literal null [19:28:45] ottomata: fixed by https://gerrit.wikimedia.org/r/#/c/376574/ [19:28:46] instead of quoated string [19:28:48] it's not deployed yet [19:28:54] ah! [19:28:55] ok [19:29:02] ok so we should ignore? [19:29:05] for now? [19:29:12] (should be fixed at least) [19:29:51] k [19:31:49] ACKNOWLEDGEMENT - puppet last run on restbase-dev1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cassandra] eevans Testing new Cassandra build [19:31:49] ACKNOWLEDGEMENT - puppet last run on restbase-dev1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 8 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cassandra] eevans Testing new Cassandra build [19:34:21] !log ppchelko@tin Started deploy [changeprop/deploy@86e8103]: Separate kafka metrics by rule [19:34:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:18] ACKNOWLEDGEMENT - puppet last run on restbase-dev1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cassandra] eevans Testing new Cassandra build [19:35:40] !log ppchelko@tin Finished deploy [changeprop/deploy@86e8103]: Separate kafka metrics by rule (duration: 01m 18s) [19:35:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:43] !log rebooting labtestvirt2001 [19:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:46] RECOVERY - Check systemd state on restbase1010 is OK: OK - running: The system is fully operational [19:56:12] 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3605721 (10Dzahn) See patch above, based on the cumin results and feedback from the first few users, in the first round i suggest the following to be whitelisted: - package bu... [20:00:04] No patches in the queue for this window. Wheeee! [20:07:24] urandom: no gerrit bot, but i just merged your restbase-dev-only change [20:07:40] dev_cluster should use the new dev version [20:09:22] robh: wanna kick wikibugs (https://tools.wmflabs.org/?tool=wikibugs) [20:09:34] i dont have the right access to do it properly [20:09:35] doc is at https://www.mediawiki.org/wiki/Wikibugs#Help [20:09:44] i was granted access but never got to where i could use it reliably [20:10:00] i can try to track someone down in about 10 minutes, middle of important procurement task [20:10:09] legoktm: ^ maybe , if you dont mind [20:10:15] robh: no worries, thanks [20:16:10] 10Operations, 10Fundraising-Backlog, 10fundraising-tech-ops, 10Spike: Spike: Enumerate remaining unported stats - https://phabricator.wikimedia.org/T175850#3605804 (10DStrine) [20:22:36] RECOVERY - puppet last run on restbase-dev1005 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [20:24:26] RECOVERY - puppet last run on restbase-dev1004 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [20:26:41] I see "480 Undefined index: requestId in /srv/mediawiki/php-1.30.0-wmf.18/extensions/EventBus/JobQueueEventBus.php on line 42" floating in fatalmonitor. Just an FYI. [20:26:47] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 26 probes of 287 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [20:27:02] 10Operations, 10Fundraising-Backlog, 10fundraising-tech-ops, 10Spike: Spike: Enumerate remaining unported stats - https://phabricator.wikimedia.org/T175850#3605833 (10DStrine) a:05cwdent>03None [20:31:46] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 287 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [20:34:21] !log deployed https://gerrit.wikimedia.org/r/#/c/377845/ (no gerrit bot, jenkins lint failing, etc) due to possible codfw routing issues [20:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:56] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 37 probes of 287 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [20:39:29] 10Operations, 10Continuous-Integration-Infrastructure, 10DNS, 10Traffic: CI: operations-dns-lint broken due to missing Maxmind DB file - https://phabricator.wikimedia.org/T175864#3605898 (10Dzahn) [20:39:47] 10Operations, 10Continuous-Integration-Infrastructure, 10DNS, 10Traffic: CI: operations-dns-lint broken due to missing Maxmind DB file - https://phabricator.wikimedia.org/T175864#3605902 (10Dzahn) [20:40:32] 10Operations, 10ops-eqdfw, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#3605904 (10Cmjohnson) Swapped the motherboard, the error still presented during installation. It's looking more like something else oth... [20:40:49] 10Operations, 10Continuous-Integration-Infrastructure, 10DNS, 10Traffic: CI: operations-dns-lint broken due to missing Maxmind DB file - https://phabricator.wikimedia.org/T175864#3605883 (10Dzahn) [20:43:04] !log re-routed ulsfo->eqiad around possible codfw damage as well - https://gerrit.wikimedia.org/r/#/c/377871/ [20:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:09] (although in theory that shouldn't be affected, as its on our own links, but just to be paranoid) [20:48:56] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 6 probes of 287 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [20:51:37] Niharika: I think that might be fixed by https://gerrit.wikimedia.org/r/377840 [20:51:38] not sure though [20:59:46] (03PS5) 10Dzahn: PHAB: deployment scripts to be called by scap [puppet] - 10https://gerrit.wikimedia.org/r/370622 (https://phabricator.wikimedia.org/T172847) (owner: 1020after4) [21:00:04] No patches in the queue for this window. Wheeee! [21:00:56] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 34 probes of 287 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [21:00:59] puck you jouncebot [21:01:39] bblack, are we stable enough to deploy changes/run maint scripts [21:01:41] ? [21:05:36] grrr [21:05:40] <3 IRC [21:05:56] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 17 probes of 287 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [21:11:06] okayyy, I'll assume if ops aren't there it's not an outage [21:14:11] musikanimal, yt? [21:14:35] I am! [21:14:40] we ready? :) [21:16:02] Dereckson: I think I fixed wiktionary, but can't tell because it's otherwise broken for me anyway. [21:16:08] (Sorry, got distracted.) [21:16:20] Also I'm totally not supposed to be up right now. [21:16:33] !log populating ip_changes on group1 wikis [21:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:18] yay [21:22:10] (03PS13) 10ArielGlenn: restructure dumps webserver, zim manifests to module/role/profile [puppet] - 10https://gerrit.wikimedia.org/r/376750 (https://phabricator.wikimedia.org/T175592) [21:23:22] mutante: thanks! [21:30:26] MaxSem|grrrr: what wiki are we on? [21:30:41] musikanimal, cawiki [21:31:26] k [21:31:58] 783K ip edits [21:33:37] I think Commons will be the biggest with 2.9 ip edits [21:34:07] *2.9 million [21:39:06] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 43 probes of 287 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [21:40:06] (03CR) 10MaxSem: [C: 032] Configure ACW destination [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377874 (owner: 10MaxSem) [21:41:40] (03Merged) 10jenkins-bot: Configure ACW destination [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377874 (owner: 10MaxSem) [21:41:54] (03CR) 10jenkins-bot: Configure ACW destination [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377874 (owner: 10MaxSem) [21:43:51] kaldari, pulled on mwdebug1002 [21:49:06] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 2 probes of 287 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [21:49:55] (03PS4) 10Dzahn: icinga: initial whitelist for screen monitoring [puppet] - 10https://gerrit.wikimedia.org/r/377823 (https://phabricator.wikimedia.org/T165348) [21:55:40] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/377874/ (duration: 00m 47s) [21:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:13] kaldari, [21:56:21] ^^^ [21:56:45] ergh urandom i got a prelim debian/ going for this [21:56:53] buuut something strange is stopping me [21:57:08] i've patched the pom.xml to use archiva to pull in dependencies [21:57:12] (03PS14) 10ArielGlenn: restructure dumps webserver, zim manifests to module/role/profile [puppet] - 10https://gerrit.wikimedia.org/r/376750 (https://phabricator.wikimedia.org/T175592) [21:57:18] and then am packaging with mvn package in override_dh_auto_build [21:57:28] during build though, it can't reach any networking [21:57:40] on copper buildserver, pbuilder drops me into a fakeroot shell [21:57:48] and i can mvn package (from archiva) just fine there [21:58:00] but, during the dpkg-buildpackage process, it doesn't work [21:58:08] (03PS5) 10Dzahn: icinga: initial whitelist for screen monitoring [puppet] - 10https://gerrit.wikimedia.org/r/377823 (https://phabricator.wikimedia.org/T165348) [22:01:03] MaxSem|grrrr: the rate at which the ip changes script is updating is slowing down [22:01:12] not sure if that's cause for alarm [22:01:25] just means it's gunna take a while!! [22:02:17] it's taking around 4 secs to do each batch of 200 [22:03:58] !log maxsem@tin Synchronized php-1.30.0-wmf.18/extensions/ArticleCreationWorkflow/: https://gerrit.wikimedia.org/r/#/c/377900/1 (duration: 00m 49s) [22:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:18] kaldari, Niharika ^^^ [22:04:45] I _joe_ Can I deploy for the jobqueue now? https://phabricator.wikimedia.org/T175316 [22:04:58] or should I wait for SWAT? [22:05:09] I'm deploying it right now [22:05:56] MaxSem|grrrr: oh, there wasn't anything in the calendar [22:06:02] sorry, let me know when you're done [22:06:17] !log maxsem@tin Synchronized php-1.30.0-wmf.18/extensions/EventBus/: https://gerrit.wikimedia.org/r/#/c/377900/1 (duration: 00m 49s) [22:06:22] Amir1, ^ [22:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:47] MaxSem|grrrr: thanks but that's not the one I want to deploy [22:06:56] It's the patch in Wikidata extension [22:07:01] ah [22:07:17] Pchelolo, I've deployed your patch [22:07:44] 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 due to OOM in Lua→PHP→Lua calls - https://phabricator.wikimedia.org/T171392#3606149 (10Verdy_p) So it solves a part of the probl... [22:07:44] but both they are related to the jobqueue [22:08:15] okay, let me start [22:08:47] _joe_ Pchelolo gwicke CC [22:09:04] <_joe_> Amir1: ? [22:09:14] musikanimal: I wonder how much quicker it would be doing less inserts, but inserting more rows at a time [22:09:32] _joe_: regarding this which is related to the whole jobqueue problem [22:09:32] https://phabricator.wikimedia.org/T175316 [22:09:58] (03PS6) 10Dzahn: icinga: initial whitelist for screen monitoring [puppet] - 10https://gerrit.wikimedia.org/r/377823 (https://phabricator.wikimedia.org/T165348) [22:10:03] _joe_: I want to deploy the hotfix if that's okay. Or I wait for the SWAT (one hour from now) [22:10:29] <_joe_> Amir1: it can wait until SWAT, but isn't it like the middle of the night for you? [22:10:38] <_joe_> it can actually wait EU SWAT tomorrow [22:10:43] ping greg-g ^ [22:10:48] <_joe_> it's important but not urgent [22:10:58] yeah, that's the reason I wanted to deploy sooner [22:11:51] _joe_: I can wait until the evening SWAT (one hour) [22:12:38] <_joe_> Amir1: I am ok whatever is your choice [22:12:52] (03PS2) 10Ebe123: Set $wgScoreSafeMode to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374446 (https://phabricator.wikimedia.org/T174413) [22:12:57] <_joe_> I was suggesting the EU swat as you'll have time to follow the effects and revert in case of issues [22:13:26] I will stay here to monitor [22:13:46] I'm usually around this time, only not drunk tonight :D [22:14:53] thank you MaxSem|grrrr [22:16:11] Added to the evening SWAT [22:21:06] (03PS4) 10Andrew Bogott: WIP: nova: turn off hourly instance usage audits [puppet] - 10https://gerrit.wikimedia.org/r/377187 [22:21:08] (03PS1) 10Andrew Bogott: bootstrap-vz: prevent puppet-agent from starting [puppet] - 10https://gerrit.wikimedia.org/r/377907 [22:21:50] (03CR) 10Andrew Bogott: [C: 032] bootstrap-vz: prevent puppet-agent from starting [puppet] - 10https://gerrit.wikimedia.org/r/377907 (owner: 10Andrew Bogott) [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170913T2300). [23:00:04] Pchelolo, Ebe123, Amir1, and mooeypoo: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:13] o/ [23:00:17] Here [23:00:29] Hello. I can SWAT this evening. [23:00:46] Mine is already deployed [23:01:12] (03PS2) 10Dzahn: Add mobile subdomains for all arbcom wikis [dns] - 10https://gerrit.wikimedia.org/r/376904 (https://phabricator.wikimedia.org/T175447) (owner: 10Reedy) [23:01:45] (03PS1) 10Thcipriani: Deploy discovery-analytics with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/377916 (https://phabricator.wikimedia.org/T129149) [23:02:04] (03CR) 10Dzahn: [C: 032] Add mobile subdomains for all arbcom wikis [dns] - 10https://gerrit.wikimedia.org/r/376904 (https://phabricator.wikimedia.org/T175447) (owner: 10Reedy) [23:02:34] mine is not testable at all [23:02:38] jobqueue [23:02:50] (03PS3) 10Dzahn: add wikimania2019 [dns] - 10https://gerrit.wikimedia.org/r/377680 [23:03:04] Have tests ready for mine [23:04:46] Someone can stand for moneypoo to test an Echo change? (https://gerrit.wikimedia.org/r/#/c/377910/) [23:04:58] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to researchers group (stat1003 and MySQL) for niharika29 - https://phabricator.wikimedia.org/T159780#3606339 (10Niharika) 05Resolved>03Open I appear to have lost my access to stat1003. Can I get it back? I use it for looking at ev... [23:05:19] 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 due to OOM in Lua→PHP→Lua calls - https://phabricator.wikimedia.org/T171392#3606341 (10zhuyifei1999) 05Open>03Resolved This... [23:05:19] I'm here [23:05:36] Dereckson, sorry, I'm here. I had a small issue typing, but I'm here. [23:05:37] ok :) [23:08:04] jouncebot: now [23:08:05] For the next 0 hour(s) and 51 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170913T2300) [23:08:07] jouncebot: next [23:08:07] In 0 hour(s) and 51 minute(s): Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170914T0000) [23:08:24] Ebe123: so you're a current maintainer of Score extension? [23:09:18] I guess so. I'm fixing old bugs and adding new features to an otherwise unmaintained project [23:09:27] * Dereckson nods [23:09:36] who's doing SWAT? [23:09:39] musikanimal: I'm [23:09:48] can we squeeze in https://gerrit.wikimedia.org/r/#/c/377908/ ? [23:10:18] Ebe123: is there some performance impact to disable this safe mode? Same question for security. [23:11:36] musikanimal: I guess so, can you cherry-pick it to the versions you want? [23:11:47] musikanimal: https://tools.wmflabs.org/versions/ [23:11:50] For security, we have enabled the use of Firejail for the commands used. The safe mode has disabled lots of useful functionality though [23:12:27] Shouldn't have any performance impact [23:12:55] 10Operations, 10ops-eqiad, 10DC-Ops: scs-c1-eqiad unresponsive - https://phabricator.wikimedia.org/T175625#3606372 (10RobH) I didn't have the serial to provide to Sales during the chat via their website, and instead had to submit a support request via their site. I got no email confirmation of the request f... [23:13:11] Dereckson: I think that's already done, and group3 gets wmf.18 tomorrow right? [23:13:12] and I see dpatrick and Reedy followed https://phabricator.wikimedia.org/T171372, okay perfect [23:13:19] musikanimal: right [23:13:40] (03CR) 10Dzahn: [C: 032] add wikimania2019 [dns] - 10https://gerrit.wikimedia.org/r/377680 (owner: 10Dzahn) [23:14:24] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374446 (https://phabricator.wikimedia.org/T174413) (owner: 10Ebe123) [23:16:12] just added it to the deployment calendar [23:16:45] dammit put it in the wrong place [23:16:59] (03Merged) 10jenkins-bot: Set $wgScoreSafeMode to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374446 (https://phabricator.wikimedia.org/T174413) (owner: 10Ebe123) [23:17:06] (03CR) 10jenkins-bot: Set $wgScoreSafeMode to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374446 (https://phabricator.wikimedia.org/T174413) (owner: 10Ebe123) [23:18:05] alrighty [23:18:40] musikanimal: Your forgot your name. :P [23:18:51] Ebe123: live on mwdebug1002 [23:18:57] 10Operations, 10DC-Ops: document all scs connections - https://phabricator.wikimedia.org/T175876#3606383 (10RobH) [23:19:09] oh put it in the wrong spot whoops [23:19:18] mwdebug1002 is where? [23:19:24] Pchelolo: could you add a '''ALREADY DONE.''' note on the Deployments page please? [23:19:32] Includes places like enwiki? [23:19:53] Yes, it's one of the application server, it serves absolutely all the regular projects [23:19:53] Dereckson: sure [23:19:56] https://wikitech.wikimedia.org/wiki/Debugging_in_production [23:20:45] You only have to add an header to your request to ask the request to go to this server. This is automated through an extension, https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug [23:21:53] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to researchers group (stat1003 and MySQL) for niharika29 - https://phabricator.wikimedia.org/T159780#3606400 (10RobH) 05Open>03Resolved a:03RobH @Niharika: Stat1003 is being decommissioned. Please see @Ottomata's email to the... [23:22:02] Pchelolo: thanks [23:22:36] Ebe123: so you install the extension, in the list you pick mwdebug1002, you put the trigger ton on, and you can test :) [23:22:39] It works [23:23:06] I can now close many bugs! [23:23:48] :) [23:24:33] Amir1: live on mwdebug1002 [23:24:38] topic branch: insecticide [23:24:47] mooeypoo: live on mwdebug1002 too [23:24:49] Dereckson: not testable, related to jobqueue :) [23:24:54] ack'ed [23:25:12] checking [23:25:45] w00t w00t [23:25:49] Dereckson, works! [23:25:51] \o/ [23:25:57] PROBLEM - EventBus HTTP Error Rate -4xx + 5xx- on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [10.0] [23:26:15] mooeypoo: ack'ed [23:26:22] Ebe123: syncing to prod [23:26:56] !log dereckson@tin Synchronized wmf-config/CommonSettings.php: Set $wgScoreSafeMode to false (T174413) (duration: 00m 50s) [23:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:15] T174413: Set $wgScoreSafeMode to false - https://phabricator.wikimedia.org/T174413 [23:28:00] !log dereckson@tin Synchronized php-1.30.0-wmf.18/extensions/Echo/modules/api/mw.echo.api.APIHandler.js: (no justification provided) (duration: 00m 48s) [23:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:02] Amir1: syncing [23:29:06] PROBLEM - EventBus HTTP Error Rate -4xx + 5xx- on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [10.0] [23:29:16] Thanks. Monitoring logstash and grafana [23:29:52] I'm looking into that ^^ [23:29:57] !log dereckson@tin Synchronized php-1.30.0-wmf.18/extensions/Wikidata/extensions/Wikibase/client: Split page set before constructing InjectRCRecordsJob (T175316) (duration: 00m 57s) [23:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:13] T175316: Very large jobs posted by Wikidata - https://phabricator.wikimedia.org/T175316 [23:30:23] musikanimal: so? [23:30:45] do you need a cherry-pick or no? [23:31:23] Dereckson: it's cherry picked [23:31:51] (03PS6) 10Dzahn: PHAB: deployment scripts to be called by scap [puppet] - 10https://gerrit.wikimedia.org/r/370622 (https://phabricator.wikimedia.org/T172847) (owner: 1020after4) [23:33:40] musikanimal: okay, so to make that clear on the deployment table, you can use the target branch as tag (here [wmf18]) and link to the cherry-pick change (here 377918) [23:34:46] musikanimal: you can test the script on terbium if I pull it there? [23:35:36] I see, okay updating table now [23:35:51] I might have to ask help from MaxSem about testing on terbium [23:36:09] (03CR) 10Dzahn: [C: 04-1] "PS6 - amended to use systemctl instead of service command" [puppet] - 10https://gerrit.wikimedia.org/r/370622 (https://phabricator.wikimedia.org/T172847) (owner: 1020after4) [23:36:09] but this is just an update to a maintenance script btw, not sure if that matters [23:36:30] ?? [23:36:39] (03CR) 10Dzahn: [C: 031] PHAB: deployment scripts to be called by scap [puppet] - 10https://gerrit.wikimedia.org/r/370622 (https://phabricator.wikimedia.org/T172847) (owner: 1020after4) [23:37:03] so right now I'm running an old version of that script [23:37:24] yeah, we were gonna do SWAT to have the new one ready for tomorrow [23:37:25] and would rather avoid stopping it [23:37:26] musikanimal: we're still waiting on Jenkins jobs — https://integration.wikimedia.org/zuul/ [23:37:55] MaxSem: ok [23:37:59] Thank you! [23:38:05] You're welcome Ebe123 :) [23:38:06] Dereckson, just sync it and we can test it later. this won't break the live site [23:38:20] Ebe123: Thanks to take care of the Lilypond layer :) [23:38:33] Devrais-je dire plutot "merci" :) [23:39:02] De rien [23:39:59] ohhhh [23:40:15] so after you cherry pick there's a new patch, jeez, sorry [23:41:05] yep [23:41:10] one per branch [23:41:44] that makes sense [23:41:47] to cherry-pick mean "I take one commit and I put it as the new HEAD of the target branch" [23:42:21] it's a new commit, as it doesn't have the same history [23:42:39] right I know what cherry picking is, but we used the cherry pick button in the first patch and I thought maybe that was it [23:43:17] makes sense now [23:43:21] :) [23:43:26] (03CR) 1020after4: [C: 031] "Can we merge this before tonight's deployment?" [puppet] - 10https://gerrit.wikimedia.org/r/370622 (https://phabricator.wikimedia.org/T172847) (owner: 1020after4) [23:44:11] (actually we've two commits here: 94060acfb07755998c946a66e3d6b011dcfccffa and a merge commit) [23:44:27] right right [23:44:53] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to researchers group (stat1003 and MySQL) for niharika29 - https://phabricator.wikimedia.org/T159780#3606446 (10Niharika) @RobH Thanks. I'm not on the Ops mailing list. Maybe Wikitech-l is a better venue for changes as drastic as this... [23:48:00] !log dereckson@tin Synchronized php-1.30.0-wmf.18/maintenance/populateIpChanges.php: Improve ip_changes update script to insert rows in batches ([[Gerrit:377918]]) (duration: 00m 49s) [23:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:22] Many many thanks! [23:48:40] 23:47:49 Check 'Logstash Error rate for mw1263.eqiad.wmnet' failed: ERROR: 50% OVER_THRESHOLD (Avg. Error rate: Before: 0.03, After: 2.00, Threshold: 1.00) [23:49:41] (almost certainly unrelated with this change) [23:52:12] Okay, I have monitored enough, everything looks fine, jobqueue, error rate, dispatch, lag, fatal monitor [23:52:24] o/ [23:53:14] thanks for checking Amir1 [23:54:43] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to researchers group (stat1003 and MySQL) for niharika29 - https://phabricator.wikimedia.org/T159780#3606465 (10RobH) It is recommended (but not required) that all folks with shell access be part of the ops list. (Not disagreeing wit... [23:56:00] (03CR) 1020after4: [C: 031] PHAB: deployment scripts to be called by scap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/370622 (https://phabricator.wikimedia.org/T172847) (owner: 1020after4) [23:56:24] (03CR) 1020after4: [C: 031] Phabricator: Redirect all http traffic to https [puppet] - 10https://gerrit.wikimedia.org/r/354247 (https://phabricator.wikimedia.org/T165643) (owner: 10Paladox) [23:57:02] (03CR) 1020after4: "@thcipriani: I will look it over once more - this was a first draft I may indeed have missed some things" [puppet] - 10https://gerrit.wikimedia.org/r/374054 (https://phabricator.wikimedia.org/T172847) (owner: 1020after4) [23:57:46] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to researchers group (stat1003 and MySQL) for niharika29 - https://phabricator.wikimedia.org/T159780#3078358 (10MaxSem) Also analytics: https://lists.wikimedia.org/pipermail/analytics/2017-August/006004.html [23:58:42] (03PS4) 1020after4: Move phabricator conf files outside of source tree [puppet] - 10https://gerrit.wikimedia.org/r/374054 (https://phabricator.wikimedia.org/T172847)