[00:06:02] (03CR) 10Gergő Tisza: [C: 031] logging: Collect mw1017 logs for debugging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269069 (https://phabricator.wikimedia.org/T117020) (owner: 10BryanDavis) [00:12:16] (03CR) 10Gergő Tisza: [C: 031] logging: Send all udp2log eligible messages to $wmgDefaultMonologHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269068 (https://phabricator.wikimedia.org/T117019) (owner: 10BryanDavis) [01:17:15] 10Ops-Access-Requests, 6operations, 10DBA, 5Patch-For-Review: Grant mysql client access to testreduce_vd and testreduce_0715 databases - https://phabricator.wikimedia.org/T125435#2007040 (10jcrespo) If it is not puppetized, it does not exist. [01:30:58] 6operations, 10Traffic, 10Wikimedia-Apache-configuration, 5Patch-For-Review: wikiversity.org and wikinews.org redirects to /503.html - https://phabricator.wikimedia.org/T109226#2007064 (10TTO) Is this fixed now? The links in the task description seem to work correctly. [01:59:11] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000000.0] [02:00:14] 6operations, 10Traffic, 10Wikimedia-Apache-configuration, 5Patch-For-Review: wikiversity.org and wikinews.org redirects to /503.html - https://phabricator.wikimedia.org/T109226#2007088 (10Krenair) Sounds like the bug only appears when HHVM is broken and a request to wikiversity.org returns a 503 and gets c... [02:13:40] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [02:20:50] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] [02:29:51] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [5000000.0] [02:31:50] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [02:44:11] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 50.00% above the threshold [1000000.0] [02:57:01] PROBLEM - puppet last run on ms-fe3001 is CRITICAL: CRITICAL: Puppet has 1 failures [03:23:59] RECOVERY - puppet last run on ms-fe3001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:25:11] 6operations, 10Traffic, 10Wikimedia-Apache-configuration, 5Patch-For-Review: wikiversity.org and wikinews.org redirects to /503.html - https://phabricator.wikimedia.org/T109226#2007137 (10BBlack) @Krenair - basically, the problem is that our Apache configuration has some implementation bugs with how 301 re... [05:37:20] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [24.0] [05:40:50] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [06:23:29] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [24.0] [06:27:26] (03PS1) 10Ori.livneh: varnish: report response age to StatsD [puppet] - 10https://gerrit.wikimedia.org/r/269086 [06:30:20] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:39] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [06:30:40] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:50] PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: puppet fail [06:30:59] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:00] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:30] PROBLEM - puppet last run on mw1251 is CRITICAL: CRITICAL: Puppet has 1 failures [06:57:09] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:57:20] RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:30] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:58:39] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:58:50] RECOVERY - puppet last run on mw1251 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:19] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:38:17] (03PS3) 10EBernhardson: Better mediawiki REPL [puppet] - 10https://gerrit.wikimedia.org/r/268541 [08:25:11] !log rebooting es2001 to es2004 for kernel update [08:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:42:47] <_joe_> !log trying a manual run of l10nupdate since it failed last night again [08:42:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:07:01] !log oblivian@tin sync-l10n completed (1.27.0-wmf.12) (duration: 11m 55s) [09:07:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:12:54] !log hhvm restarted on mw1034.eqiad.wmnet due to hhvm package update [09:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:13:50] 6operations, 10ops-codfw: es2004 doesn't come back up after reboot - https://phabricator.wikimedia.org/T126203#2007324 (10MoritzMuehlenhoff) 3NEW a:3Papaul [09:15:11] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Feb 8 09:15:11 UTC 2016 (duration 8m 10s) [09:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:15:25] !log hhvm restarted on mw1044.eqiad.wmnet due to hhvm package update [09:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:16:48] ACKNOWLEDGEMENT - Host es2004 is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff stuck after reboot, T126203 [09:24:10] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 65.22% of data above the critical threshold [5000000.0] [09:25:45] (03Abandoned) 10Addshore: Add wikidata.org high edit count monitoring [puppet] - 10https://gerrit.wikimedia.org/r/268662 (owner: 10Addshore) [09:29:13] !log rebooting es2005,es2007,es2009,es2010 for kernel update [09:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:34:57] (03PS1) 10Hashar: package_builder: set HOOKDIR only when it exists [puppet] - 10https://gerrit.wikimedia.org/r/269095 (https://phabricator.wikimedia.org/T125999) [09:38:59] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 50.00% above the threshold [1000000.0] [09:40:21] (03CR) 10Hashar: "That is to help for T125999 . I have cherry picked the patch on integration puppet master and ran puppet on the slave that runs the debian" [puppet] - 10https://gerrit.wikimedia.org/r/269095 (https://phabricator.wikimedia.org/T125999) (owner: 10Hashar) [09:43:59] 6operations, 10ops-codfw: es2004 doesn't come back up after reboot - https://phabricator.wikimedia.org/T126203#2007368 (10jcrespo) I made it boot, it only got stuck asking for F1 to continue. The actual error I think it is memory related, but BIOS messages get confusing over serial console + SSH: ``` Phoenix... [09:44:35] PROBLEM - mysqld processes on es2004 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [09:45:12] !log starting es2004 [09:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:46:35] RECOVERY - mysqld processes on es2004 is OK: PROCS OK: 1 process with command name mysqld [09:55:18] (03CR) 10Filippo Giunchedi: "I'm not 100% it is a typo since it would make sense anyway to put git clean in the background, Chase?" [puppet] - 10https://gerrit.wikimedia.org/r/268341 (owner: 1020after4) [09:59:41] (03CR) 10Filippo Giunchedi: [C: 031] swift: move roles to modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/260610 (owner: 10Dzahn) [10:06:05] (03CR) 10Filippo Giunchedi: [C: 031] Apertium: Move logs to /srv/log [puppet] - 10https://gerrit.wikimedia.org/r/268852 (https://phabricator.wikimedia.org/T107900) (owner: 10Mobrovac) [10:07:28] (03CR) 10Filippo Giunchedi: [C: 031] Apertium: Fix --log-path position in SystemD unit file [puppet] - 10https://gerrit.wikimedia.org/r/268856 (owner: 10Mobrovac) [10:11:24] (03PS2) 10Jcrespo: Enforce SSL on change master [software] - 10https://gerrit.wikimedia.org/r/258167 [10:11:35] (03CR) 10Filippo Giunchedi: [C: 04-1] Zotero: Move logs to /srv/log (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/268847 (https://phabricator.wikimedia.org/T107900) (owner: 10Mobrovac) [10:11:44] (03CR) 10Jcrespo: [C: 032 V: 032] Enforce SSL on change master [software] - 10https://gerrit.wikimedia.org/r/258167 (owner: 10Jcrespo) [10:19:46] 6operations, 7Monitoring: [RFC] Alert about *when* partitions will run out of space, not a percentage/absolute number - https://phabricator.wikimedia.org/T126158#2007423 (10fgiunchedi) +1 would be good to have this [10:20:17] !log changing s2 replication topology in preparation for master failover [10:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:23:12] 6operations, 7Availability, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: swiftrepl replication pass for thumbnails eqiad -> codfw - https://phabricator.wikimedia.org/T125791#2007424 (10fgiunchedi) similar distribution after ~150k requests: ``` $ sort ~/thumbs_requests | sort | uniq -c | sort -nr | head -5... [10:25:29] !log upgrading jobrunners/imagescalers in eqiad for hhvm float timeout fix [10:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:26:58] !log rebooting es2006,es2008 for kernel update [10:27:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:32:53] 6operations, 10Traffic: Upgrade to Varnish 4: things to remember - https://phabricator.wikimedia.org/T126206#2007434 (10ema) 3NEW a:3ema [10:34:38] (03CR) 10Hashar: "It is probably of no use, a passing build had:" [puppet] - 10https://gerrit.wikimedia.org/r/269095 (https://phabricator.wikimedia.org/T125999) (owner: 10Hashar) [10:35:09] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 810 [10:37:49] (03CR) 10Filippo Giunchedi: "LGTM! it feels like this could be a generally useful command to have (also to non-root if we chmod a+x /var/lib/puppet) and signal success" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/268684 (owner: 10Ema) [10:40:09] RECOVERY - check_mysql on db1008 is OK: Uptime: 1710114 Threads: 2 Questions: 9352728 Slow queries: 11540 Opens: 4120 Flush tables: 2 Open tables: 421 Queries per second avg: 5.469 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:42:02] 6operations, 10Traffic: Upgrade to Varnish 4: things to remember - https://phabricator.wikimedia.org/T126206#2007468 (10ori) [ ] Make sure that [[ https://github.com/wikimedia/operations-puppet/blob/24cc170e/modules/varnish/files/varnishlog.py | `modules/varnish/files/varnishlog.py` ]] (and the metric loggers... [10:45:28] (03PS1) 10Gehel: Ship Elasticsearch logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/269100 (https://phabricator.wikimedia.org/T109101) [10:47:35] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "We already have a motd line telling us that "puppet ran N minutes ago". Wouldn't it be better to merge the two?" [puppet] - 10https://gerrit.wikimedia.org/r/268684 (owner: 10Ema) [11:00:05] moritzm: Respected human, time to deploy Terbium maintenance (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160208T1100). Please do the needful. [11:00:30] !log rebooting terbium for kernel update [11:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:04:54] (03PS1) 10Hashar: contint: set pbuilder basepath to actual directory [puppet] - 10https://gerrit.wikimedia.org/r/269103 (https://phabricator.wikimedia.org/T125999) [11:05:37] !log rebooting db2012 for kernel update [11:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:13:24] (03CR) 10Hashar: "That is not the issue actually :-D" [puppet] - 10https://gerrit.wikimedia.org/r/269103 (https://phabricator.wikimedia.org/T125999) (owner: 10Hashar) [11:25:35] !log starting mysql at db2012 [11:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:26:37] (03CR) 10Ema: "Agreed, we can install puppet-enabled as a separate script and use it in 97-last-puppet-run instead of having an additional motd line. Tha" [puppet] - 10https://gerrit.wikimedia.org/r/268684 (owner: 10Ema) [11:26:52] !log stopping mysql at db2012 [11:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:27:04] 6operations, 10Traffic, 10Wikimedia-Apache-configuration, 5Patch-For-Review: Sometimes apache error 503s redirect to /503.html and get cached - https://phabricator.wikimedia.org/T109226#2007520 (10Krenair) [11:27:23] 6operations, 10Traffic, 10Wikimedia-Apache-configuration, 5Patch-For-Review: Sometimes apache error 503s redirect to /503.html and this redirect gets cached - https://phabricator.wikimedia.org/T109226#1543344 (10Krenair) [11:36:07] (03PS12) 10Elukey: Adding a new email template for Burrow lag alerts. [puppet] - 10https://gerrit.wikimedia.org/r/268682 (https://phabricator.wikimedia.org/T126008) [11:36:28] 6operations, 10DBA: Reimage db2012 - https://phabricator.wikimedia.org/T126209#2007543 (10Krenair) [11:39:14] 6operations, 10DBA: Reimage db2012 - https://phabricator.wikimedia.org/T126209#2007550 (10jcrespo) [11:41:03] is there somone else exept lego and hoo to help with big global rename? They are always off when im on :/ [11:41:17] :S [11:41:23] that lego, always slacking [11:42:42] Steinsplitter: sorry, I really was just about to sleep :( [11:43:14] (03PS2) 10BBlack: VCL: do not use illegal "trusted" XFF values for XCIP [puppet] - 10https://gerrit.wikimedia.org/r/266486 (https://phabricator.wikimedia.org/T120121) [11:44:26] (03CR) 10BBlack: [C: 032 V: 032] VCL: do not use illegal "trusted" XFF values for XCIP [puppet] - 10https://gerrit.wikimedia.org/r/266486 (https://phabricator.wikimedia.org/T120121) (owner: 10BBlack) [11:53:10] !log rebooting cp1074, cp3047 (for kernels, also to compare bios/drac settings...) [11:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:05:15] (03PS3) 10Yurik: Set CSP to false [puppet] - 10https://gerrit.wikimedia.org/r/268677 [12:05:16] <_joe_> !log restarted cron on tin, to catch up with the uid change for the l10nupdate user [12:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:06:20] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp1074_v4, cp1074_v6 [12:06:29] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp1074_v4, cp1074_v6 [12:06:29] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp1074_v4, cp1074_v6 [12:06:39] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp1074_v4, cp1074_v6 [12:07:00] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp1074_v4, cp1074_v6 [12:07:21] PROBLEM - IPsec on cp4015 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp1074_v4, cp1074_v6 [12:07:21] PROBLEM - IPsec on cp4013 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp1074_v4, cp1074_v6 [12:07:21] PROBLEM - IPsec on cp4006 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp1074_v4, cp1074_v6 [12:07:21] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp1074_v4, cp1074_v6 [12:07:21] PROBLEM - IPsec on cp4007 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp1074_v4, cp1074_v6 [12:07:21] PROBLEM - IPsec on cp4014 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp1074_v4, cp1074_v6 [12:07:22] PROBLEM - IPsec on cp4005 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp1074_v4, cp1074_v6 [12:07:26] (03PS2) 10Ema: Display a message in motd if puppet agent is disabled [puppet] - 10https://gerrit.wikimedia.org/r/268684 [12:07:30] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp1074_v4, cp1074_v6 [12:07:39] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp1074_v4, cp1074_v6 [12:07:39] PROBLEM - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp1074_v4, cp1074_v6 [12:07:40] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp1074_v4, cp1074_v6 [12:07:49] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp1074_v4, cp1074_v6 [12:07:49] PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp1074_v4, cp1074_v6 [12:07:49] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp1074_v4, cp1074_v6 [12:07:50] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp1074_v4, cp1074_v6 [12:07:50] PROBLEM - IPsec on cp3042 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp1074_v4, cp1074_v6 [12:07:50] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp1074_v4, cp1074_v6 [12:07:50] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp1074_v4, cp1074_v6 [12:07:51] PROBLEM - IPsec on cp3033 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp1074_v4, cp1074_v6 [12:07:51] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp1074_v4, cp1074_v6 [12:07:52] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp1074_v4, cp1074_v6 [12:07:52] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp1074_v4, cp1074_v6 [12:07:53] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp1074_v4, cp1074_v6 [12:08:40] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3047_v4, cp3047_v6 [12:08:40] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 162 not-conn: cp3047_v4, cp3047_v6 [12:08:40] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3047_v4, cp3047_v6 [12:08:50] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 162 not-conn: cp3047_v4, cp3047_v6 [12:08:51] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3047_v4, cp3047_v6 [12:08:59] <_joe_> we could make this check intelligent [12:09:06] <_joe_> and detect if a server is depooled [12:09:20] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3047_v4, cp3047_v6 [12:09:20] PROBLEM - IPsec on cp1061 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3047_v4, cp3047_v6 [12:09:21] PROBLEM - IPsec on cp1051 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3047_v4, cp3047_v6 [12:09:30] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3047_v4, cp3047_v6 [12:09:30] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3047_v4, cp3047_v6 [12:09:30] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3047_v4, cp3047_v6 [12:09:40] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 162 not-conn: cp3047_v4, cp3047_v6 [12:09:49] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3047_v4, cp3047_v6 [12:09:50] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 162 not-conn: cp3047_v4, cp3047_v6 [12:09:50] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 162 not-conn: cp3047_v4, cp3047_v6 [12:09:59] PROBLEM - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 162 not-conn: cp3047_v4, cp3047_v6 [12:10:00] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3047_v4, cp3047_v6 [12:10:00] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3047_v4, cp3047_v6 [12:14:55] (03PS13) 10Elukey: Adding a new email template for Burrow lag alerts. [puppet] - 10https://gerrit.wikimedia.org/r/268682 [12:15:09] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 58 ESP OK [12:15:10] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 58 ESP OK [12:15:10] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 58 ESP OK [12:15:15] _joe_: eh, it's tricky. ideally we want that check to be sensitive so long as the host is booted and configured... [12:15:19] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 38 ESP OK [12:15:19] RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 38 ESP OK [12:15:20] RECOVERY - IPsec on cp3042 is OK: Strongswan OK - 38 ESP OK [12:15:20] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 38 ESP OK [12:15:20] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 38 ESP OK [12:15:20] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 38 ESP OK [12:15:20] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 38 ESP OK [12:15:21] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 164 ESP OK [12:15:21] RECOVERY - IPsec on cp3033 is OK: Strongswan OK - 38 ESP OK [12:15:22] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 38 ESP OK [12:15:22] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 38 ESP OK [12:15:23] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 38 ESP OK [12:15:23] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 38 ESP OK [12:15:24] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 58 ESP OK [12:15:29] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 164 ESP OK [12:15:29] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 164 ESP OK [12:15:30] RECOVERY - IPsec on kafka1018 is OK: Strongswan OK - 164 ESP OK [12:15:31] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 58 ESP OK [12:15:31] since it is a security thing [12:15:34] <_joe_> yep [12:15:39] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 58 ESP OK [12:15:41] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 38 ESP OK [12:15:42] but yeah it sucks presently [12:15:49] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 38 ESP OK [12:15:49] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 38 ESP OK [12:15:59] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 38 ESP OK [12:16:09] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 58 ESP OK [12:16:10] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 58 ESP OK [12:16:10] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 164 ESP OK [12:16:19] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 38 ESP OK [12:16:19] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 164 ESP OK [12:16:20] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 58 ESP OK [12:16:41] RECOVERY - IPsec on cp4015 is OK: Strongswan OK - 38 ESP OK [12:16:41] RECOVERY - IPsec on cp4013 is OK: Strongswan OK - 38 ESP OK [12:16:41] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 38 ESP OK [12:16:41] RECOVERY - IPsec on cp4006 is OK: Strongswan OK - 38 ESP OK [12:16:41] RECOVERY - IPsec on cp4007 is OK: Strongswan OK - 38 ESP OK [12:16:41] RECOVERY - IPsec on cp4014 is OK: Strongswan OK - 38 ESP OK [12:16:41] RECOVERY - IPsec on cp4005 is OK: Strongswan OK - 38 ESP OK [12:16:49] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 58 ESP OK [12:16:49] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 38 ESP OK [12:16:49] RECOVERY - IPsec on cp1061 is OK: Strongswan OK - 58 ESP OK [12:16:50] RECOVERY - IPsec on cp1051 is OK: Strongswan OK - 58 ESP OK [12:17:00] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 38 ESP OK [12:17:00] RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 38 ESP OK [12:17:00] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 38 ESP OK [12:20:26] (03CR) 10Aude: Use custom generator for mobile search on Wikibase (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254645 (https://phabricator.wikimedia.org/T85368) (owner: 10Bene) [12:27:31] (03CR) 10JanZerebecki: [C: 031] Use custom generator for mobile search on Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254645 (https://phabricator.wikimedia.org/T85368) (owner: 10Bene) [12:31:22] (03CR) 10Aude: Use custom generator for mobile search on Wikibase (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254645 (https://phabricator.wikimedia.org/T85368) (owner: 10Bene) [12:32:25] <_joe_> !log uploaded a new pybal package; installing on codfw and ulsfo backups [12:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:32:30] !log restarting hhvm on mw1052, mw1075, mw1080, mw1081, mw1094, mw1095 to rollout the new version [12:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:32:39] <_joe_> elukey: is it needed? [12:32:53] <_joe_> elukey: apt-get restarts hhvm when you install the package [12:33:58] _joe_: nope! [12:34:35] I missed this part, thought the opposite (shame on me) [12:35:45] ah yes, I should read more carefully apt's output [12:36:05] got it, thanks! [12:40:45] (03PS4) 10Aude: Use custom generator for mobile search on Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254645 (https://phabricator.wikimedia.org/T85368) (owner: 10Bene) [12:41:31] 6operations, 10DBA, 5Patch-For-Review: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#2007630 (10jcrespo) ``` differences in schema in db1018.eqiad.wmnet ********************************** ======================================================================... [12:41:38] (03CR) 10Alexandros Kosiaris: [C: 031] "@Filippo, I doubt that. It is not really a process that consumes enough time to make it worthy to pararellize. Plus it is not clearly docu" [puppet] - 10https://gerrit.wikimedia.org/r/268341 (owner: 1020after4) [12:44:21] (03CR) 10JanZerebecki: Use custom generator for mobile search on Wikibase (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254645 (https://phabricator.wikimedia.org/T85368) (owner: 10Bene) [12:45:26] (03PS5) 10Aude: Use custom generator for mobile search on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254645 (https://phabricator.wikimedia.org/T85368) (owner: 10Bene) [12:46:47] (03PS6) 10Aude: Use custom generator for mobile search on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254645 (https://phabricator.wikimedia.org/T85368) (owner: 10Bene) [12:48:21] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - OK - All pools are healthy [12:49:30] (03CR) 10JanZerebecki: [C: 031] Use custom generator for mobile search on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254645 (https://phabricator.wikimedia.org/T85368) (owner: 10Bene) [12:50:11] PROBLEM - PyBal backends health check on lvs2004 is CRITICAL: PYBAL CRITICAL - OK - All pools are healthy [12:50:40] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [12:50:49] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [12:53:51] PROBLEM - PyBal backends health check on lvs2005 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Internal Server Error [12:56:13] (03CR) 10ArielGlenn: "I think we can make a decsion here about what we want. Do we want the git checkout to proceed only if the git clean is sucessful (double &" [puppet] - 10https://gerrit.wikimedia.org/r/268341 (owner: 1020after4) [12:56:55] !log updated hhvm on mw1080, mv1084, mw1241 [12:56:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:01:59] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:02:01] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:03:20] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [13:05:11] RECOVERY - PyBal backends health check on lvs2005 is OK: PYBAL OK - All pools are healthy [13:05:19] RECOVERY - PyBal backends health check on lvs2004 is OK: PYBAL OK - All pools are healthy [13:05:23] <_joe_> !log roll back installation of pybal, issues with upd and ipv6 [13:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:06:08] (03PS1) 10KartikMistry: lttoolbox: New upstream version [debs/contenttranslation/lttoolbox] - 10https://gerrit.wikimedia.org/r/269115 (https://phabricator.wikimedia.org/T124137) [13:09:11] !log updated hhvm on mw2016.codfw.wmnet, mw2161.codfw.wmnet, mw2199.codfw.wmnet, mw1259.eqiad.wmnet, mw1260.eqiad.wmnet [13:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:12:41] !log start up more rolling cache reboots for kernels (cpNNNN) [13:12:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:14:12] 6operations, 6Project-Creators: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#2007678 (10Aklapper) Proposal looks good to me so feel free to go ahead. (Not sure "what the rest" means here, if there are specific points please bring them up and I'm happy to commen... [13:17:18] (03CR) 10Filippo Giunchedi: [C: 031] "yeah @akosiaris you are indeed right, it wouldn't make sense to background!" [puppet] - 10https://gerrit.wikimedia.org/r/268341 (owner: 1020after4) [13:37:50] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 162 not-conn: cp3037_v4, cp3037_v6 [13:38:00] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 161 no-child-sa: cp3004_v4 not-conn: cp3037_v4, cp3037_v6 [13:38:32] ^ just me [13:38:36] probably will be a few more [13:40:49] PROBLEM - Host cp3037 is DOWN: PING CRITICAL - Packet loss = 100% [13:41:40] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 164 ESP OK [13:42:20] RECOVERY - Host cp3037 is UP: PING OK - Packet loss = 0%, RTA = 86.06 ms [13:43:11] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 164 ESP OK [13:56:56] !log cpNNNN rolling reboots paused (3038 still coming up) [13:56:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:59:35] (03PS2) 10Rush: Add missing & (typo?) [puppet] - 10https://gerrit.wikimedia.org/r/268341 (owner: 1020after4) [14:00:00] 6operations, 7JavaScript: Instability on fr.wikiversity project - https://phabricator.wikimedia.org/T112069#2007747 (10Aklapper) [14:00:23] (03PS3) 10Faidon Liambotis: git: add missing & (typo?) to exec [puppet] - 10https://gerrit.wikimedia.org/r/268341 (owner: 1020after4) [14:01:33] (03PS2) 10BBlack: cache_mobile LVS decom: 1/2 remove LVS service [puppet] - 10https://gerrit.wikimedia.org/r/268226 (https://phabricator.wikimedia.org/T109286) [14:01:49] (03CR) 10Rush: [C: 032 V: 032] "It's a typo for sure" [puppet] - 10https://gerrit.wikimedia.org/r/268341 (owner: 1020after4) [14:01:58] (03PS2) 10Muehlenhoff: Add base::firewall to jobrunners mw1161-mw1169 (reprovisioned app servers) [puppet] - 10https://gerrit.wikimedia.org/r/267238 [14:02:30] (03PS3) 10BBlack: cache_mobile LVS decom: 1/2 remove LVS service [puppet] - 10https://gerrit.wikimedia.org/r/268226 (https://phabricator.wikimedia.org/T109286) [14:02:54] (03CR) 10Rush: "why make the change?" [puppet] - 10https://gerrit.wikimedia.org/r/268360 (https://phabricator.wikimedia.org/T121861) (owner: 10Filippo Giunchedi) [14:03:24] (03CR) 10BBlack: [C: 032 V: 032] cache_mobile LVS decom: 1/2 remove LVS service [puppet] - 10https://gerrit.wikimedia.org/r/268226 (https://phabricator.wikimedia.org/T109286) (owner: 10BBlack) [14:03:51] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add base::firewall to jobrunners mw1161-mw1169 (reprovisioned app servers) [puppet] - 10https://gerrit.wikimedia.org/r/267238 (owner: 10Muehlenhoff) [14:03:56] !log starting mobile LVS service decom (IPs moving to text) - puppet disabled on text caches and high-traffic1 LVSes [14:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:06:18] (03PS3) 10Muehlenhoff: Add base::firewall to jobrunners mw1161-mw1169 (reprovisioned app servers) [puppet] - 10https://gerrit.wikimedia.org/r/267238 [14:06:29] (03CR) 10Muehlenhoff: [V: 032] Add base::firewall to jobrunners mw1161-mw1169 (reprovisioned app servers) [puppet] - 10https://gerrit.wikimedia.org/r/267238 (owner: 10Muehlenhoff) [14:09:40] (03CR) 10Rush: [C: 031] "as I understand it this is the right idea, but I have not tested the config." [puppet] - 10https://gerrit.wikimedia.org/r/268851 (https://phabricator.wikimedia.org/T118176) (owner: 10Dzahn) [14:17:24] (03PS3) 10BBlack: cache_mobile decom: 1/2 remove realserver IPs [puppet] - 10https://gerrit.wikimedia.org/r/268228 (https://phabricator.wikimedia.org/T109286) [14:17:42] (03CR) 10BBlack: [C: 032 V: 032] cache_mobile decom: 1/2 remove realserver IPs [puppet] - 10https://gerrit.wikimedia.org/r/268228 (https://phabricator.wikimedia.org/T109286) (owner: 10BBlack) [14:23:30] PROBLEM - puppet last run on cp1046 is CRITICAL: CRITICAL: puppet fail [14:25:20] RECOVERY - puppet last run on cp1046 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [14:28:54] (03Abandoned) 10Muehlenhoff: Enable ferm on mw1161-mw1169 [puppet] - 10https://gerrit.wikimedia.org/r/260340 (owner: 10Muehlenhoff) [14:39:44] (03PS2) 10BBlack: cache_mobile LVS decom: 2/3 remove conftool node data [puppet] - 10https://gerrit.wikimedia.org/r/268227 (https://phabricator.wikimedia.org/T109286) [14:39:46] (03PS1) 10BBlack: cache_mobile LVS decom: 3/3 remove conftool service data [puppet] - 10https://gerrit.wikimedia.org/r/269127 (https://phabricator.wikimedia.org/T109286) [14:41:04] !log mobile LVS service decom complete (IPs now belong to text service) [14:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:41:40] (03CR) 10BBlack: [C: 032] cache_mobile LVS decom: 2/3 remove conftool node data [puppet] - 10https://gerrit.wikimedia.org/r/268227 (https://phabricator.wikimedia.org/T109286) (owner: 10BBlack) [14:46:40] (03CR) 10BBlack: [C: 032] cache_mobile LVS decom: 3/3 remove conftool service data [puppet] - 10https://gerrit.wikimedia.org/r/269127 (https://phabricator.wikimedia.org/T109286) (owner: 10BBlack) [14:50:16] 6operations, 10RESTBase-Cassandra: impact of large sstables on cassandra - https://phabricator.wikimedia.org/T126221#2007841 (10fgiunchedi) 3NEW [14:50:54] 6operations, 10Beta-Cluster-Infrastructure, 6Services, 5Patch-For-Review: Move Node.JS services to Jessie and Node 4.2 - https://phabricator.wikimedia.org/T124989#2007854 (10hashar) For the sources repositories under mediawiki/services/.* you should now be able to comment `check experimental` to trigger `... [14:57:35] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Maps hardware planning for FY16/17 - https://phabricator.wikimedia.org/T125126#2007868 (10mark) >>! In T125126#1995337, @EBernhardson wrote: > The varnishes, having previously served the entirety of mobile traffic, are likely plenty to handle any inc... [14:59:10] (03PS3) 10BBlack: cache_mobile decom: 2/2 Remove most cache config [puppet] - 10https://gerrit.wikimedia.org/r/268229 (https://phabricator.wikimedia.org/T109286) [15:01:43] !log restarting and upgrading dbstore1001 (db backups agent host) [15:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:02:20] (03PS2) 10Elukey: dhcp: switch mc1004/1005 to jessie installer [puppet] - 10https://gerrit.wikimedia.org/r/268311 (https://phabricator.wikimedia.org/T123711) (owner: 10Dzahn) [15:03:25] (03CR) 10Elukey: [C: 032] dhcp: switch mc1004/1005 to jessie installer [puppet] - 10https://gerrit.wikimedia.org/r/268311 (https://phabricator.wikimedia.org/T123711) (owner: 10Dzahn) [15:10:20] (03PS4) 10BBlack: cache_mobile decom: 2/2 Remove most cache config [puppet] - 10https://gerrit.wikimedia.org/r/268229 (https://phabricator.wikimedia.org/T109286) [15:10:31] (03CR) 10BBlack: [C: 032 V: 032] cache_mobile decom: 2/2 Remove most cache config [puppet] - 10https://gerrit.wikimedia.org/r/268229 (https://phabricator.wikimedia.org/T109286) (owner: 10BBlack) [15:12:42] (03PS1) 10Jcrespo: Enable ferm for dbstore databases on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/269135 [15:12:51] PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 323 bytes in 0.022 second response time [15:12:52] PROBLEM - Auth DNS for labs pdns on labs-ns3.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [15:14:05] ^dns issues in labs...again [15:14:27] <_joe_> chasemp: are you on it? [15:14:42] 6operations, 10EventBus, 6Services, 10hardware-requests: 4 more Kafka brokers, 2 in eqiad and 2 codfw - https://phabricator.wikimedia.org/T124469#2007906 (10Ottomata) Ah! No, this isn't really about eventbus. It is, sorta but it isn't. It is more about planning future cross DC use of the main Kafka cl... [15:14:44] !restarting pdns and pdns-recursor on labservices1001 [15:14:45] RECOVERY - Auth DNS for labs pdns on labs-ns3.wikimedia.org is OK: DNS OK: 7.934 seconds response time. nagiostest.eqiad.wmflabs returns [15:14:55] _joe_: yeah assuming a simple restart works atm, I'll talk w/ andrew more about it [15:14:57] later [15:15:11] this is happening consistently, even similar times [15:16:43] RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 781277 bytes in 4.552 second response time [15:17:11] both the auth server and the recursor? [15:17:49] I *think* it's the auth server that stops working, and anything else is secondary fallout [15:18:01] (the auth server that answers auth queries by querying LDAP...) [15:18:28] 6operations, 10EventBus, 6Services, 10hardware-requests: 4 more Kafka brokers, 2 in eqiad and 2 codfw - https://phabricator.wikimedia.org/T124469#2007913 (10Ottomata) > will all logical topics have several Kafka topics, prefixed / post-fixed with DC name, as discussed in T123954? A simpler answer: No. U... [15:22:36] chasemp, bblack, this is https://phabricator.wikimedia.org/T124680 [15:22:41] not that there’s anything of use there :( [15:22:50] (03CR) 10Ottomata: Increase length of lag window to 100 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/268594 (https://phabricator.wikimedia.org/T125916) (owner: 10Nuria) [15:23:28] have you looked into the pdns metrics yet? [15:24:08] also is it just one or multiple auth servers going down at the same time? [15:24:08] no [15:24:46] mark: I think it’s just one, but it’s hard to tell since the server logs themselves don’t report any issues. [15:25:09] have you increased debug/log level? [15:26:41] it’s at 6 now. I haven’t dug through the logs for today yet, though, doing that now. [15:26:52] PROBLEM - puppet last run on netmon1001 is CRITICAL: CRITICAL: puppet fail [15:26:56] I’ll have a look this morning about moving the logs to a dedicated file and increasing the level yet more. [15:27:28] 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#2007923 (10BBlack) [15:27:41] 6operations, 10Traffic, 6Zero: Use Text IP for Mobile hostnames to gain SPDY/H2 coalesce between the two - https://phabricator.wikimedia.org/T124482#2007927 (10BBlack) [15:27:43] 6operations, 6Discovery, 10Maps, 10Traffic, 5Patch-For-Review: Set up proper edge Varnish caching for maps cluster - https://phabricator.wikimedia.org/T109162#2007928 (10BBlack) [15:27:46] 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#2007924 (10BBlack) 5Open>3Resolved [15:28:22] checking out netmon, probably related to my cache_mobile commits... [15:28:59] yeah torrus, I guess I can't ignore it :) [15:29:00] what happens at 15:00 utc, I wonder? Or at 5 minutes to? [15:29:58] I missed the log... too early [15:30:03] !log restarting pdns and pdns-recursor on labservices1001 [15:30:04] !log stopping redis and memcached for mc1004.eqiad.wmnet due to Jessie re-image [15:30:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:31:48] (03PS1) 10BBlack: torrus: remove cache_mobile stuff [puppet] - 10https://gerrit.wikimedia.org/r/269141 (https://phabricator.wikimedia.org/T109286) [15:32:13] chasemp: 0 queries are falling over to holmium. So that might be something work looking at, on the client side. [15:32:24] But in the meantime I guess we can discount holmium from the investigation [15:32:25] (03CR) 10BBlack: [C: 032 V: 032] torrus: remove cache_mobile stuff [puppet] - 10https://gerrit.wikimedia.org/r/269141 (https://phabricator.wikimedia.org/T109286) (owner: 10BBlack) [15:33:23] (03PS16) 10Thcipriani: Puppet provider for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/262742 (https://phabricator.wikimedia.org/T113072) (owner: 10Alexandros Kosiaris) [15:33:25] (03PS1) 10Thcipriani: Add scap3 deployment option for services [puppet] - 10https://gerrit.wikimedia.org/r/269143 [15:37:14] (03CR) 10Thcipriani: [C: 04-1] "WIP" [puppet] - 10https://gerrit.wikimedia.org/r/269143 (owner: 10Thcipriani) [15:38:27] 6operations, 10ops-eqiad, 10Traffic: eqiad cache cluster re-arrangements - https://phabricator.wikimedia.org/T125486#2007975 (10BBlack) [15:39:53] RECOVERY - puppet last run on netmon1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:41:30] 6operations, 10hardware-requests: Upgrade restbase100[7-9] to match restbase100[1-6] hardware - https://phabricator.wikimedia.org/T119935#2007979 (10fgiunchedi) agreed the decommission seem tight but doable, also free disk space on nodes at any given time can be hard to predict (details in T126221) re: LVM, w... [15:42:32] 6operations, 10ops-codfw, 10procurement: codfw- db2012- 2x600GB SAS drives purchase request - https://phabricator.wikimedia.org/T126226#2007989 (10Papaul) 3NEW a:3RobH [15:44:28] 6operations, 10ops-codfw: db2012 degraded RAID - https://phabricator.wikimedia.org/T124645#2007997 (10Papaul) Chris mentioned that he doesn't have those drives on-site so I am making another task to order drives for this system. see:T126226 [15:46:02] (03CR) 10Ottomata: Add scap3 deployment option for services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/269143 (owner: 10Thcipriani) [15:49:27] 6operations, 10ops-eqiad: dbstore1001 management interface has saturated the number of available ssh connections - https://phabricator.wikimedia.org/T126227#2008004 (10jcrespo) 3NEW [15:50:15] chasemp: how sure are you that the outage was still happening when you restarted pdns, and how sure are you that that fixed it? [15:50:56] there is a definite race condition of assurance where I see it and then assume my restart has an effect [15:51:45] I keep wondering if we’re seeing a network outage rather than a dns outage [15:51:53] no evidence of that, really [15:53:06] a network outage that effects labs maybe, esp the tools-home host I could see, but the general labs-ns3 check is from neon to labservices1001 right? [15:53:16] no other evidence of issues in that path at all I think [15:53:45] ok, have a look at https://ganglia.wikimedia.org/latest/?r=2hr&cs=&ce=&m=cpu_report&c=Miscellaneous+eqiad&h=labservices1001.wikimedia.org&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=NOGROUPS [15:53:49] and click on pdns metrics [15:54:56] I'm realizing I thought you meant queries to the host failing and not from the host to ldap fyi [15:55:07] (03PS1) 10Krinkle: cache: Normalise hostname for /w/skins,resources,extensions [puppet] - 10https://gerrit.wikimedia.org/r/269149 (https://phabricator.wikimedia.org/T99096) [15:55:31] (03CR) 10jenkins-bot: [V: 04-1] cache: Normalise hostname for /w/skins,resources,extensions [puppet] - 10https://gerrit.wikimedia.org/r/269149 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [15:55:40] andrewbogott: so is something hammering pdns at that time? https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20eqiad&h=labservices1001.wikimedia.org&r=2hr&z=default&jr=&js=&st=1454946831&event=hide&ts=0&v=2&m=pdns_concurrent-queries&vl=bytes&z=large [15:55:46] chasemp: looks like [15:56:08] although if pdns was backed up for some reason (e.g. slow sql response) it might look the same? [15:56:25] oh, no it wouldn’t, not the ‘queries’ graph [15:56:37] (03CR) 10Filippo Giunchedi: "mostly no reason to use statsd afaik, also we could use less statsd traffic in general" [puppet] - 10https://gerrit.wikimedia.org/r/268360 (https://phabricator.wikimedia.org/T121861) (owner: 10Filippo Giunchedi) [15:56:55] jynus: says the db side is responsive and I tend to believe him and he also indicated a rash of activity but I believe no db level load issues [15:57:05] 6operations, 10ops-codfw, 10procurement: codfw- db2012and db2019 5x600GB SAS drives purchase request - https://phabricator.wikimedia.org/T126226#2008038 (10Papaul) [15:57:05] oops trying to quote not communicate jynus :) [15:57:23] well, I cannot discard it 100% [15:57:31] 6operations, 10Deployment-Systems, 6Performance-Team, 10Traffic, 5Patch-For-Review: Make Varnish cache for /static/$wmfbranch/ expire when resources change within branch lifetime - https://phabricator.wikimedia.org/T99096#2008045 (10Krinkle) [15:57:48] andrewbogott: is there a way to straight up log every query? [15:57:53] I just do not see an issue from server side, like less activity, connection loss, etc. [15:58:09] chasemp: there is. [15:58:22] right now the logs are going into syslog — I can’t decide if that’s a problem or not. [15:58:24] monitoring usually catches that, but maybe not if the period is too short [15:58:39] chasemp: but yeah, one strategy is to turn that on and just wait until it happens again... [15:59:17] 6operations, 10ops-codfw: db2019 has a failed disk - https://phabricator.wikimedia.org/T120073#2008051 (10jcrespo) [15:59:19] 6operations, 10ops-codfw, 10procurement: codfw- db2012and db2019 5x600GB SAS drives purchase request - https://phabricator.wikimedia.org/T126226#2008052 (10jcrespo) [15:59:44] considering that spike I want to see who is doing it and I'm suspicious of a common culprit across occurrences [16:00:04] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160208T1600). [16:00:04] Addshore aude Dereckson: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:16] chasemp: yep, I’ll have a patch shortly [16:00:52] * aude here [16:01:10] I can SWAT. addshore Dereckson ping! [16:01:13] Hi [16:01:15] !log restarting pdns on labservices 1001 to test loglevels [16:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:01:24] 6operations, 10ops-codfw: es2010 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T117848#2008067 (10jcrespo) [16:01:43] chasemp: have a look at syslog on labservices1001 now [16:01:47] thcipriani: here! [16:01:57] should be enough info, you think? If /var doesn’t fill up :) [16:02:03] andrewbogott: although a client side abuser...how would that relate to https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20eqiad&h=labservices1001.wikimedia.org&r=2hr&z=default&jr=&js=&st=1454947244&event=hide&ts=0&v=93849&m=pdns_cache-entries&vl=bytes&z=large [16:02:12] (03PS4) 10Thcipriani: wgRCWatchCategoryMembership true everywhere except wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264734 (owner: 10Addshore) [16:02:16] i.e. why is cache entries dropping out? [16:02:41] and it may happen periodically https://ganglia.wikimedia.org/latest/graph.php?r=month&z=xlarge&c=Miscellaneous+eqiad&h=labservices1001.wikimedia.org&jr=&js=&event=hide&ts=0&v=93849&m=pdns_cache-entries&vl=bytes [16:02:42] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264734 (owner: 10Addshore) [16:03:24] (03Merged) 10jenkins-bot: wgRCWatchCategoryMembership true everywhere except wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264734 (owner: 10Addshore) [16:03:26] chasemp: is that metric the # of entries in the cache, or the # of hits? [16:04:03] here is a separate hits I think so it's clearing cache entirely maybe? [16:04:12] lot of interesting activity, esp https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20eqiad&h=labservices1001.wikimedia.org&r=2hr&z=default&jr=&js=&st=1454947244&event=hide&ts=0&v=0.233333333333&m=pdns_outgoing-timeouts&vl=qps&z=large [16:04:20] which I take to mean timeouts to ldap [16:04:52] this server isn’t using ldap [16:05:53] (03PS1) 10Andrew Bogott: Turn pdns loglevels WAY UP [puppet] - 10https://gerrit.wikimedia.org/r/269152 (https://phabricator.wikimedia.org/T124680) [16:05:58] (03PS2) 10Dzahn: Apertium: Fix --log-path position in SystemD unit file [puppet] - 10https://gerrit.wikimedia.org/r/268856 (owner: 10Mobrovac) [16:06:02] I think that metric just means “I’m swamped and have started dropping queries" [16:07:24] oh that's right... i added pdns metrics to ganglia a long time ago [16:07:43] (03PS17) 10Krinkle: Implement /w/static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263566 (https://phabricator.wikimedia.org/T99096) [16:07:47] mark: yeah! Very handy :) [16:08:36] jynus: looks like a lot of ongoing database things happening in fatalmonitor, is it fine that I'm SWATting right now? [16:08:42] jynus: https://logstash.wikimedia.org/#/dashboard/elasticsearch/fatalmonitor [16:09:23] chasemp: if the cache is actively being cleared, it’s not regular enough to be caused by e.g. a cron or something [16:09:45] checking [16:10:03] 6operations, 10Security-Reviews, 7Surveys: Re-evaluate Limesurvey - https://phabricator.wikimedia.org/T109606#2008134 (10chasemp) a:5chasemp>3None I was reflecting the conclusions above me more than decision making. >>! In T109606#1946136, @egalvezwmf wrote: > Hey @chasemp - can you offer some informati... [16:10:05] the metrics are described in the pdns docs, if you hadn't found them already [16:10:43] for some reason that wasn't showing on https://logstash.wikimedia.org/#/dashboard/elasticsearch/wfLogDBError [16:11:17] mark, I’m looking but haven’t found them yet [16:11:50] * ebernhardson needs to put together a no-highlight script for logstash urls that include elasticsearch :P [16:12:07] me too :) [16:12:53] I cannot see the problem, those are rpc calls, but I see no lag on my monitoring [16:13:11] jynus: max concurrency? [16:13:30] bbl [16:13:57] if you are talking production, no, that is mediawiki what it is on read-only mode [16:14:45] I think it is due to the topology [16:15:04] there is no outage, but 1 out of 2 api servers is failing [16:15:11] let me try something [16:15:28] chasemp: "shows the number of entries in the cache” you’re right [16:15:32] (mediawiki should not assume that api server's master is the real master) [16:15:32] so the cache is suddenly emptying [16:16:58] jynus: if everything seems ok on your end (aside from the api server) is SWAT ok to continue? [16:17:39] <_joe_> jynus: need help? [16:18:08] not a blocker for deployment, ortogonal, keep on [16:18:14] (03PS2) 10Krinkle: cache: Normalise hostname for /w/skins,resources,extensions [puppet] - 10https://gerrit.wikimedia.org/r/269149 (https://phabricator.wikimedia.org/T99096) [16:18:16] jynus: thank you [16:18:34] (03PS1) 10Muehlenhoff: Unique slapo-unique to ensure uniqueness of gidNumber for groups [puppet] - 10https://gerrit.wikimedia.org/r/269155 [16:18:44] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, we can still bikeshed on the naming though!" [puppet] - 10https://gerrit.wikimedia.org/r/268684 (owner: 10Ema) [16:18:50] (03CR) 10Dzahn: [C: 032] Apertium: Fix --log-path position in SystemD unit file [puppet] - 10https://gerrit.wikimedia.org/r/268856 (owner: 10Mobrovac) [16:18:59] bblack: ori: First blind pass at VCL change - https://gerrit.wikimedia.org/r/269149 [16:19:40] that should fix it [16:19:48] !log thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: wgRCWatchCategoryMembership true everywhere except wikisource [[gerrit:264734]] (duration: 01m 26s) [16:19:50] ^ addshore check please [16:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:19:59] *checks* [16:20:31] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268467 (https://phabricator.wikimedia.org/T125353) (owner: 10Addshore) [16:21:14] looks good to me [16:21:34] addshore: thanks! [16:22:07] (03PS2) 10Dzahn: Apertium: Move logs to /srv/log [puppet] - 10https://gerrit.wikimedia.org/r/268852 (https://phabricator.wikimedia.org/T107900) (owner: 10Mobrovac) [16:22:52] Thanks mutante! [16:23:16] * thcipriani mentally wills zuul to pick up +2 [16:23:28] what change? [16:23:30] I think it did not fully work, so putting back all slaves back into the original topology [16:24:08] !log reverting slaves topology back to db1024 master [16:24:09] Dereckson: https://gerrit.wikimedia.org/r/#/c/268467/ [16:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:24:21] chasemp: is there any chance what we’re seeing is the ttl for .wmflabs expiring? Or just a bunch of *.wmflabs entries expiring at once? [16:24:50] If we assume that a query hits basically every labs instance every minute, then all the ttls will be in sync since the last restart. [16:25:18] (03CR) 10KartikMistry: [C: 031] Apertium: Move logs to /srv/log [puppet] - 10https://gerrit.wikimedia.org/r/268852 (https://phabricator.wikimedia.org/T107900) (owner: 10Mobrovac) [16:25:22] I don't know enough to say yes or no, but it's a sharp dropoff for that kindof thing [16:26:19] (03Merged) 10jenkins-bot: Add $wgWBRepoSettings['sparqlEndpoint'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268467 (https://phabricator.wikimedia.org/T125353) (owner: 10Addshore) [16:26:23] if the domain ttl expires, does that not expire subdomains? [16:26:28] It is a load balancer "bug" [16:26:37] hm, ttl should only be 60 minutes though, so that doesn’t really add up [16:27:18] <_joe_> !log reinstalling pybal's new version (reduced) on ulsfo and codfw caches [16:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:27:22] !log restarted nutcracker in G@cluster:appserver and G@site:eqiad due to connect error issues (5 hosts per batch) [16:27:24] mediawiki checks the lag and the state of a slave master to check if it can be in read-write mode [16:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:27:46] but a slave's master doesn't mean it is the real master [16:28:05] this didn't put mediawiki in read-only mode: edits continued [16:28:12] but some api calls failed [16:28:12] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268208 (https://phabricator.wikimedia.org/T120197) (owner: 10Aude) [16:28:17] 6operations, 10Deployment-Systems, 5Patch-For-Review: l10nupdate user uid mismatch between tin and mira - https://phabricator.wikimedia.org/T119165#2008191 (10Joe) Just for the record, cron needs to be restarted if an uid has been changed, this made the l10nupdate job fail for days in a row [16:28:18] oh, also, chasemp, the cache drop-off is happening after the first outage alerts. So I don’t think it can be the cause. [16:28:26] it could be [16:28:31] from restart to fix? :) [16:28:38] !log thcipriani@mira Synchronized wmf-config/Wikibase-production.php: SWAT: Add $wgWBRepoSettings[sparqlEndpoint] [[gerrit:268467]] (duration: 01m 18s) [16:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:28:43] or "fix" if that's all misguided [16:28:46] ^ addshore check if possible [16:28:48] do we know how that happened with the failed ssh logins on labs phab ? [16:28:54] i also cant use my root login [16:28:56] another reason to move the laod balancing outside of mediawiki- no real topology understanding [16:29:14] chasemp: oh, you’re right, it’s just you restarting [16:29:18] So long as wikidata is still up :) [16:29:45] addshore: looks like it :) [16:29:47] (03Merged) 10jenkins-bot: Don't request pageprops for mobile search/nearby on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268208 (https://phabricator.wikimedia.org/T120197) (owner: 10Aude) [16:29:55] chasemp: so I think we’re back to ‘turn up logging and wait for it to happen again' [16:30:07] still here :) [16:30:17] That's me all done then! [16:30:22] Many thanks! [16:32:09] Heh, and wifi has now just broken so only on my phone now.... [16:32:23] !log thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: Do not request pageprops for mobile search/nearby on wikidata [[gerrit:268208]] (duration: 01m 20s) [16:32:26] ^ aude check please [16:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:32:28] I sympathize addshore. [16:32:30] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [100000000.0] [16:32:39] (03CR) 10Dzahn: [C: 032] Apertium: Move logs to /srv/log [puppet] - 10https://gerrit.wikimedia.org/r/268852 (https://phabricator.wikimedia.org/T107900) (owner: 10Mobrovac) [16:32:56] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254645 (https://phabricator.wikimedia.org/T85368) (owner: 10Bene) [16:33:15] ok [16:33:43] addshore: you must be at office [16:33:53] (03Merged) 10jenkins-bot: Use custom generator for mobile search on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254645 (https://phabricator.wikimedia.org/T85368) (owner: 10Bene) [16:34:37] thcipriani: i think it's ok [16:34:45] aude: kk, thanks. [16:36:43] mutante: no, lots of thunder and lightning where I am today though [16:36:53] 6operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Check the redis (jobqueue) configuration in codfw - https://phabricator.wikimedia.org/T124672#2008216 (10Joe) a:3Joe [16:37:29] (03CR) 10Dzahn: "please make the logpath consistent with https://gerrit.wikimedia.org/r/#/c/268852 as filippo said in the inline comment" [puppet] - 10https://gerrit.wikimedia.org/r/268847 (https://phabricator.wikimedia.org/T107900) (owner: 10Mobrovac) [16:37:53] !log thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: Use custom generator for mobile search on Wikidata Part I [[gerrit:254645]] (duration: 01m 18s) [16:37:54] addshore: oh :) [16:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:39:17] !log thcipriani@mira Synchronized wmf-config/mobile.php: SWAT: Use custom generator for mobile search on Wikidata Part II [[gerrit:254645]] (duration: 01m 19s) [16:39:19] ^ aude check please [16:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:40:07] ok [16:40:54] looks good [16:41:11] 6operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Check the redis (jobqueue) configuration in codfw - https://phabricator.wikimedia.org/T124672#2008225 (10Joe) All in all, the core count in codfw is 48, vs 50 in eqiad. but the total memory is 60% of what we have in eqiad. To err on the side of caut... [16:41:16] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268573 (https://phabricator.wikimedia.org/T125801) (owner: 10Dereckson) [16:41:33] aude: thank you! [16:41:59] Reported as https://phabricator.wikimedia.org/T111266#2008233 [16:42:17] (03Merged) 10jenkins-bot: Namespaces configuration on mai.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268573 (https://phabricator.wikimedia.org/T125801) (owner: 10Dereckson) [16:44:50] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267997 (https://phabricator.wikimedia.org/T125509) (owner: 10Dereckson) [16:45:02] RECOVERY - RAID on db2012 is OK: OK: optimal, 1 logical, 2 physical [16:45:07] !log thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: Namespaces configuration on mai.wikipedia [[gerrit:268573]] (duration: 01m 17s) [16:45:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:45:10] ^ Dereckson check please [16:45:31] (03PS7) 10Nuria: Increase length of lag window to 100 [puppet] - 10https://gerrit.wikimedia.org/r/268594 (https://phabricator.wikimedia.org/T125916) [16:46:04] (03Merged) 10jenkins-bot: Enable signature button for the Project namespace in ru.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267997 (https://phabricator.wikimedia.org/T125509) (owner: 10Dereckson) [16:46:37] thcipriani: there are a lot of aliases, so I'd suggest to run namespacesDupe script preventively. [16:46:59] Tested. Aliases work. [16:48:35] Hi [16:48:51] Is there a reason why Ghostscript isn't used on Mediawiki to render EPS? [16:49:32] Further to the SVG issues the other day , I decided to see what the actual EPS was rendering as [16:49:41] and it was OK [16:49:53] Dereckson: mwscript namespaceDupes.php --wiki=maiwiki run [16:49:57] Hi ShakespeareFan00 [16:50:13] As it's a pain "cleaning" up EPS->PDF->SVG... Inaccuracies [16:50:17] 6operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Have a strategy to switch restbase to use services in the appropriate datacenter - https://phabricator.wikimedia.org/T126235#2008250 (10Joe) 3NEW [16:50:21] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [16:50:29] I'm in need of a long term soloution [16:50:36] 6operations, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2008256 (10Dzahn) [16:50:37] ShakespeareFan00: you want #wikimedia-tech or to fill a bug on Phabricator to discuss this issue [16:52:47] !log thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable signature button for the Project namespace in ru.wiki [[gerrit:267997]] (duration: 01m 19s) [16:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:52:53] Testing. [16:53:06] thcipriani: works [16:53:11] Dereckson: thank you! [16:53:19] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267822 (https://phabricator.wikimedia.org/T121766) (owner: 10Dereckson) [16:53:43] Krinkle: why switch to commons in https://gerrit.wikimedia.org/r/#/c/269149 ? [16:54:13] ShakespeareFan00: #wikimedia-commons is to handle the daily running of servers and to deploy code, not to offer architecture suggestion. To use ghostscript for such rendering would indeed have advantages. Extension:PdfHandler uses it. [16:54:28] ShakespeareFan00: er #wikimedia-operations not #wikimedia-commons [16:54:45] Dereckson : OK [16:55:10] ShakespeareFan00: I'd suggest to fill a task on Phabricator, add add PdfHandler author in cc [16:55:14] Moving this disscussion [16:55:19] see you in -tech [16:55:25] k [16:56:19] (03CR) 10Andrew Bogott: [C: 031] "This sounds great, although I haven't been involved in ensuring that this is currently the case." [puppet] - 10https://gerrit.wikimedia.org/r/269155 (owner: 10Muehlenhoff) [16:56:37] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267820 (https://phabricator.wikimedia.org/T125315) (owner: 10Dereckson) [16:57:28] (03Merged) 10jenkins-bot: Deploy Translate extension on ru.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267822 (https://phabricator.wikimedia.org/T121766) (owner: 10Dereckson) [16:58:05] (03Merged) 10jenkins-bot: Set category collation on gd.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267820 (https://phabricator.wikimedia.org/T125315) (owner: 10Dereckson) [16:58:29] For the translate extension reedy already created the tables. [16:58:47] Dereckson: I saw that on the ticket, thank you :) [16:59:42] !log thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: Deploy Translate extension on ru.wikimedia [[gerrit:267822]] (duration: 01m 17s) [16:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:59:50] ^ Dereckson check please [17:00:39] Works. [17:01:19] 6operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Be able to switch programmatically between deployment servers in codfw and eqiad - https://phabricator.wikimedia.org/T124024#2008279 (10Joe) The test went well (sort of, we had an outage due to an operational error) and we're ready to switch back to t... [17:02:14] Dereckson: I'm going to bump https://gerrit.wikimedia.org/r/#/c/260541/ since I'm already over time :( [17:02:42] k [17:02:59] !log thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: Set category collation on gd.wikipedia [[gerrit:267820]] (duration: 01m 21s) [17:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:03:53] 6operations: make network saturation alert on labstore1003 sane - https://phabricator.wikimedia.org/T126237#2008292 (10chasemp) 3NEW [17:04:02] Dereckson: updateCollation done for gdwiki [17:04:07] k [17:04:23] and namespacesdupe is done too? [17:06:06] Dereckson: for maiwiki? yes. [17:06:47] Okay. Thank you for the deploy. [17:07:16] Dereckson: thank you! [17:08:22] 6operations, 10Deployment-Systems, 5Patch-For-Review: l10nupdate user uid mismatch between tin and mira - https://phabricator.wikimedia.org/T119165#2008312 (10Dzahn) >>! In T119165#2008191, @Joe wrote: > Just for the record, cron needs to be restarted if an uid has been changed, this made the l10nupdate job... [17:21:54] 6operations, 10Traffic: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#2008345 (10BBlack) [17:21:55] (03CR) 10Ottomata: Increase length of lag window to 100 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/268594 (https://phabricator.wikimedia.org/T125916) (owner: 10Nuria) [17:23:36] jouncebot: refresh [17:23:38] I refreshed my knowledge about deployments. [17:23:44] jouncebot: next [17:23:45] In 2 hour(s) and 6 minute(s): Debug logging improvements (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160208T1930) [17:25:37] (03PS2) 10Mobrovac: Zotero: Move logs to /srv/log [puppet] - 10https://gerrit.wikimedia.org/r/268847 (https://phabricator.wikimedia.org/T107900) [17:29:42] (03CR) 10Mobrovac: Zotero: Move logs to /srv/log (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/268847 (https://phabricator.wikimedia.org/T107900) (owner: 10Mobrovac) [17:41:28] bblack: the entry point doesn't exist on www [17:42:05] bblack: we can switch the existing one to non-www though [17:42:15] (03PS1) 10EBernhardson: Dont reindex wgCirrusSearchNamespaceWeights from 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269168 [17:42:59] bd808: ^ might help with wikitech search, not sure. Turns out the namespace boosts were completely wrong because array_merge( $a, $b ) decided to reindex the array [17:43:05] (03CR) 10jenkins-bot: [V: 04-1] Dont reindex wgCirrusSearchNamespaceWeights from 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269168 (owner: 10EBernhardson) [17:43:40] (03PS2) 10EBernhardson: Dont reindex wgCirrusSearchNamespaceWeights from 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269168 [17:47:38] 6operations, 5Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Reduce the number of appservers we're using in eqiad preparing for decommission - https://phabricator.wikimedia.org/T126242#2008460 (10Joe) 3NEW a:3Joe [17:50:56] (03CR) 10Rush: [C: 031] "thanks man, let's keep an eye on perf and log size :)" [puppet] - 10https://gerrit.wikimedia.org/r/269152 (https://phabricator.wikimedia.org/T124680) (owner: 10Andrew Bogott) [17:52:29] (03PS2) 10Andrew Bogott: Turn pdns loglevels WAY UP [puppet] - 10https://gerrit.wikimedia.org/r/269152 (https://phabricator.wikimedia.org/T124680) [17:53:37] anyone thinking of deploying parsoid please let me know, I have a little testing I'd like to have you do [17:53:45] if you don't mind wasting an additional 5 minutes [17:54:21] apergos, ok. in 3 hours. [17:54:27] 6operations, 5Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Reduce the number of appservers we're using in eqiad - https://phabricator.wikimedia.org/T126242#2008517 (10Joe) [17:54:48] apergos, https://www.mediawiki.org/wiki/Parsoid/Deployments#Monday.2C_Feb_8_2016_around_1:15_pm_PT:_4d44fcc7_to_be_deployed .. so, starting with the deploy after today's one .. the stuck process problem should hopefully be resolved. [17:54:52] I live patched the module on mira to take the timeout setting so hoping that does what it needs [17:54:54] (03CR) 10Andrew Bogott: [C: 032] Turn pdns loglevels WAY UP [puppet] - 10https://gerrit.wikimedia.org/r/269152 (https://phabricator.wikimedia.org/T124680) (owner: 10Andrew Bogott) [17:54:58] oh *after* today's one [17:55:00] meeeehhh [17:55:14] because the test won't obviously do much if some restarts never happen [17:55:18] yes .. since the restart today will still be using the previous deploy's code. :) [17:55:23] because no timeout will be long enough for those :-D [17:55:33] right. [17:55:35] well hm [17:55:42] could you after you do the restart [17:55:53] and shoot the bad parsoids [17:56:00] or I or whoever is here shoots them [17:56:06] can I get you to try a restart again [17:56:13] via git deploy? [17:56:21] could I try your experiment after we finish the deploy? [17:56:21] you mean [17:56:25] yes [17:56:34] do your deploy and do your restart however you do it [17:56:41] sure .. after we are done with the deploy, i can issue another restart with the git deploy script. [17:56:46] yeah exactly [17:57:02] you need to give it a timeout arg but we'll talk about it then [17:57:29] sounds good. [17:57:46] 6operations, 7Monitoring, 5Patch-For-Review: switch diamond to use graphite line protocol - https://phabricator.wikimedia.org/T121861#2008533 (10chasemp) I would like to keep the ability to send timers [18:00:35] <_joe_> thcipriani: I have a conflicting meeting again, but next week I should be back on track [18:00:47] _joe_: kk, thanks. [18:01:01] <_joe_> thcipriani: sorry :) [18:01:03] 6operations, 10Analytics, 10ArchCom-RfC, 6Discovery, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#2008548 (10Milimetric) [18:01:11] 10Ops-Access-Reviews: Get Alex Monk access to wikitech-static - https://phabricator.wikimedia.org/T125715#2008549 (10Andrew) This was approved during the meeting today. I'll create a user with sudo for Alex to use. [18:01:15] <_joe_> thcipriani: actually, do you need me for discussing something specific? [18:01:20] (03PS3) 10Dzahn: admin: add arlolra,cscott,gwicke to parsoid-test-admins [puppet] - 10https://gerrit.wikimedia.org/r/268808 (https://phabricator.wikimedia.org/T124701) [18:01:34] (03CR) 10Dzahn: [C: 032] "approved in meeting" [puppet] - 10https://gerrit.wikimedia.org/r/268808 (https://phabricator.wikimedia.org/T124701) (owner: 10Dzahn) [18:01:56] _joe_: I don't think so, not this meeting, I think from the dev-side we fell a little behind last week. [18:02:01] andrewbogott, regarding DNS, are you sure I am looking at the right place, m5-master? [18:02:06] (03PS1) 10ArielGlenn: media directory list generation: allow specification of wikis to skip [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/269173 [18:02:08] <_joe_> ok [18:02:17] jynus: pretty sure, but let me double-check [18:02:31] because m5-master is a SPOF, with no plan B in case of hardware failure [18:02:39] you should not rely on that only [18:03:10] I mean, there are backups, but the time to recovery would be long (hours) [18:03:11] jynus: yes, gmysql-host=m5-master.eqiad.wmnet [18:03:13] (03CR) 10ArielGlenn: [C: 032] media directory list generation: allow specification of wikis to skip [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/269173 (owner: 10ArielGlenn) [18:03:32] If dns depends on that, we should but it behind a proxy [18:03:36] jynus: that’s what all labs/openstack services are using [18:03:36] *put [18:03:48] (03CR) 10Rush: [C: 04-1] "talked to filippo and I would like to hold off on this for a bit as:" [puppet] - 10https://gerrit.wikimedia.org/r/268360 (https://phabricator.wikimedia.org/T121861) (owner: 10Filippo Giunchedi) [18:04:12] not happy with the current setup, you should ask for a full HA setup [18:04:24] (which I can provide) [18:04:32] jynus: ok, I’ll make a ticket, thank you [18:05:06] (03PS3) 10EBernhardson: Dont reindex wgCirrusSearchNamespaceWeights from 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269168 [18:07:09] 10Ops-Access-Requests, 6operations, 6Parsing-Team, 5Patch-For-Review: Getting parsing-team members sudo access to manage (start, stop, restart) services on ruthenium - https://phabricator.wikimedia.org/T124701#2008565 (10Dzahn) 5Open>3Resolved @arlolra @cscott @gwicke you have new permissions to contr... [18:08:16] (03PS2) 10Dzahn: admin: add mobrovac to mathoid and cxserver admins [puppet] - 10https://gerrit.wikimedia.org/r/268839 (https://phabricator.wikimedia.org/T125879) [18:10:40] (03CR) 10Dzahn: [C: 032] admin: add mobrovac to mathoid and cxserver admins [puppet] - 10https://gerrit.wikimedia.org/r/268839 (https://phabricator.wikimedia.org/T125879) (owner: 10Dzahn) [18:10:52] (03CR) 10Dzahn: "approved in meeting" [puppet] - 10https://gerrit.wikimedia.org/r/268839 (https://phabricator.wikimedia.org/T125879) (owner: 10Dzahn) [18:11:01] (03CR) 10Aaron Schulz: [C: 031] Rationalize definition of service hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266509 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [18:13:13] 6operations, 10Traffic, 6Zero, 3Mobile-Content-Service, and 3 others: Enable X-CS headers for non-mobile domains - https://phabricator.wikimedia.org/T126053#2008599 (10jhobs) [18:13:22] !log cleanup snapshots on labstore1001 [18:13:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:13:32] (03PS2) 10Jcrespo: Enable ferm for dbstore databases on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/269135 [18:14:21] godog: I have 459247b51fcc588d9c5a5c11746414ac298cc14d for the 3.0 tag [18:14:29] (03CR) 10EBernhardson: [C: 032] Dont reindex wgCirrusSearchNamespaceWeights from 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269168 (owner: 10EBernhardson) [18:14:58] thcipriani: ack, same here, thanks! [18:15:23] (03Merged) 10jenkins-bot: Dont reindex wgCirrusSearchNamespaceWeights from 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269168 (owner: 10EBernhardson) [18:15:28] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: add mobrovac to mathoid and cxserver admins (was: Allow mobrovac to start/stop/restart services on SCx) - https://phabricator.wikimedia.org/T125879#2008614 (10Dzahn) @mobrovac you are now also an admin of cxserver, puppet just added you on sca1001. abou... [18:15:40] (03CR) 10Jcrespo: [C: 032] Enable ferm for dbstore databases on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/269135 (owner: 10Jcrespo) [18:17:38] !log ebernhardson@mira Synchronized wmf-config/CirrusSearch-common.php: dont reindex wgCirrusSearchNamespaceWeights from 0 (duration: 01m 17s) [18:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:18:13] twentyafterfour: what's a file I could check to see if it is updated? I tried recreating the tarball from 3.0 tag but it ends up being the same of what I've used [18:19:33] godog: I checked scap/deploy.py and noted that DeployLocal extends DeployApplication but in my repo DeployApplication is gone and DeployLocal extends cli.Application [18:20:42] godog: I see what happened - it wasn't your package, it was my vm instance - apt-get update has been broken in labs [18:21:09] so it was installing an old version of the package. I'm not sure why the version number was 3.0-1 [18:21:17] 10Ops-Access-Requests, 6operations, 10DBA, 5Patch-For-Review: Grant mysql client access to testreduce_vd and testreduce_0715 databases - https://phabricator.wikimedia.org/T125435#2008649 (10jcrespo) Now that T124701 has been clarified, I will apply this. [18:21:19] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: add mobrovac to mathoid and cxserver admins (was: Allow mobrovac to start/stop/restart services on SCx) - https://phabricator.wikimedia.org/T125879#2008651 (10Dzahn) 5Open>3Resolved nevermind, it's ok, you are mathoid admin on scb100x as it should be... [18:21:48] godog: `apt-get upgrade scap` seems to have fixed it [18:22:51] 10Ops-Access-Requests, 6operations: add mobrovac to mathoid and cxserver admins (was: Allow mobrovac to start/stop/restart services on SCx) - https://phabricator.wikimedia.org/T125879#2008659 (10Dzahn) [18:23:21] 10Ops-Access-Requests, 6operations, 6Parsing-Team: Getting parsing-team members sudo access to manage (start, stop, restart) services on ruthenium - https://phabricator.wikimedia.org/T124701#2008660 (10Dzahn) [18:24:12] twentyafterfour: oh ok! yeah that explains it [18:24:14] (03PS6) 10Dzahn: Adding user gehel (Guillaume Lederrey) to user list and to necessary groups [puppet] - 10https://gerrit.wikimedia.org/r/267919 (https://phabricator.wikimedia.org/T125651) (owner: 10Gehel) [18:24:20] (03PS1) 10ArielGlenn: listwikiuploaddirs pylint and pep8 fixes [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/269177 [18:24:49] 6operations, 5Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Reduce the number of appservers we're using in eqiad - https://phabricator.wikimedia.org/T126242#2008678 (10Johsthao) [18:24:55] 6operations, 10ops-codfw, 10procurement: codfw- db2012and db2019 5x600GB SAS drives purchase request - https://phabricator.wikimedia.org/T126226#2008683 (10Johsthao) [18:25:01] godog: sorry about that, didn't mean to waste your time like that [18:25:06] !log re-enabled puppet on mc1004 [18:25:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:25:23] twentyafterfour: nah that's fine! no worries at all [18:25:29] 6operations, 10Traffic: Upgrade to Varnish 4: things to remember - https://phabricator.wikimedia.org/T126206#2008698 (10Johsthao) [18:25:38] 6operations, 10ops-codfw: es2004 doesn't come back up after reboot - https://phabricator.wikimedia.org/T126203#2008700 (10Johsthao) [18:25:50] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [18:26:08] Hey ops team, we have an issue with a guy spamming our phabricator tasks [18:26:20] How can we either ban, or something else ? [18:26:43] yei, spam! [18:27:05] jynus: who? [18:27:08] oops [18:27:10] wrong person [18:27:14] joal: who / where? [18:27:18] :-) [18:27:18] Johsthao [18:27:40] Seems mostly analytics oriented (not too bad for others) [18:28:01] chasemp: example: https://phabricator.wikimedia.org/T126250 [18:28:42] joal: user is disabled, I'll try ot note on their talk page [18:28:59] Thanks chasemp [18:29:05] \o/ ! [18:29:08] some maps tickets got closed [18:29:14] (03CR) 10ArielGlenn: [C: 032] listwikiuploaddirs pylint and pep8 fixes [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/269177 (owner: 10ArielGlenn) [18:29:19] (03PS8) 10Aaron Schulz: Define Production service entries for InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266510 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [18:29:23] chasemp: any idea how to revert automagically, or shall we go by hand ? [18:29:34] !log performing rolling restart of Cassandra in staging (to pickup /usr/share/cassandra/lib/cassandra-brotli-1.0.0-a64ce47.jar in classpath) [18:29:37] (03CR) 10jenkins-bot: [V: 04-1] Define Production service entries for InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266510 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [18:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:29:41] joal: there is no magic undo user actions atm [18:29:53] joal: you can do batch actions to reopen, but this presumably still wrecked assignees [18:29:55] oki, we are gonna do some cleaning then :) [18:29:59] although, maybe not [18:30:06] let me see if i can undo this quickly [18:30:30] !log applying ferm on dbstore1001 and dbstore1002 [18:30:31] MatmaRex: If you do that, a-team owes a at list a beer :) [18:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:31:15] (03CR) 10Aaron Schulz: Define Production service entries for InitialiseSettings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266510 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [18:31:27] (03CR) 10Dzahn: [C: 032] "https://www.mediawiki.org/wiki/User:GLederrey_%28WMF%29" [puppet] - 10https://gerrit.wikimedia.org/r/267919 (https://phabricator.wikimedia.org/T125651) (owner: 10Gehel) [18:32:35] 6operations, 5Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Switchover of the application servers to codfw - https://phabricator.wikimedia.org/T124671#2008789 (10matmarex) [18:32:37] 6operations, 5Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Reduce the number of appservers we're using in eqiad - https://phabricator.wikimedia.org/T126242#2008788 (10matmarex) 5duplicate>3Open [18:32:43] 6operations, 10ops-codfw: db2019 has a failed disk - https://phabricator.wikimedia.org/T120073#2008798 (10matmarex) [18:32:51] 6operations, 10ops-codfw, 10procurement: codfw- db2012and db2019 5x600GB SAS drives purchase request - https://phabricator.wikimedia.org/T126226#2008793 (10matmarex) 5duplicate>3Open [18:33:28] 6operations, 10Traffic: Upgrade to Varnish 4: things to remember - https://phabricator.wikimedia.org/T126206#2008815 (10matmarex) 5duplicate>3Open [18:33:33] 6operations, 10ops-codfw: es2004 doesn't come back up after reboot - https://phabricator.wikimedia.org/T126203#2008820 (10matmarex) 5duplicate>3Open [18:34:05] joal: okay, i guess it worked [18:34:11] MatmaRex: yay! [18:34:11] Yay ! [18:34:31] i pasted the list of task ids into the search at https://phabricator.wikimedia.org/maniphest/query/advanced/ [18:34:40] and reopened them all with a batch action [18:34:51] i hope i haven't made more of a mess than it was before [18:34:55] You rock MatmaRex ! [18:35:01] We'll finish up the cleaning :) [18:35:04] :) thx MatmaRex [18:35:29] !log Reopened 54 Phabricator tasks that someone merged into one, hope I haven't made more of a mess than it was before [18:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:39:47] (03PS1) 10ArielGlenn: dumps: skip labswiki, labtestwiki for generating media dir lists [puppet] - 10https://gerrit.wikimedia.org/r/269179 [18:39:51] 6operations, 10DBA: Populate the wikishared db on all dbstores - https://phabricator.wikimedia.org/T126252#2008884 (10jcrespo) 3NEW a:3jcrespo [18:39:55] (03CR) 10Filippo Giunchedi: [C: 04-1] "logrotate glob isn't correct, good to go otherwise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/268847 (https://phabricator.wikimedia.org/T107900) (owner: 10Mobrovac) [18:41:43] (03PS2) 10ArielGlenn: dumps: skip labswiki, labtestwiki for generating media dir lists [puppet] - 10https://gerrit.wikimedia.org/r/269179 [18:42:00] MatmaRex: ty [18:42:04] 10Ops-Access-Requests, 6operations, 3Discovery-Search-Sprint, 5Patch-For-Review: Access for new Discovery OpsEng: Guillaume Lederrey - https://phabricator.wikimedia.org/T125651#2008916 (10Dzahn) @gehel on bast1001, your user has just been created Notice: Finished catalog run in 33.79 seconds [bast1001:~]... [18:42:49] 6operations, 5Patch-For-Review: reinstall eqiad memcache servers with jessie - https://phabricator.wikimedia.org/T123711#2008944 (10elukey) Update about today: I followed https://wikitech.wikimedia.org/wiki/User:Elukey/Ops/JessieMigration with mc1004 starting with: sudo service redis-server stop (this one is... [18:43:02] 6operations, 10RESTBase-Cassandra: impact of large sstables on cassandra - https://phabricator.wikimedia.org/T126221#2008949 (10fgiunchedi) ah that might have something to do with {T94121} too [18:43:32] !log rolling Cassandra restart in restbase staging complete [18:43:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:44:39] (03CR) 10ArielGlenn: [C: 032] dumps: skip labswiki, labtestwiki for generating media dir lists [puppet] - 10https://gerrit.wikimedia.org/r/269179 (owner: 10ArielGlenn) [18:45:47] Krinkle: set up 30 min for today cc milimetric mforns , shouldn't be that long [18:46:22] please ignore whines from snapshot1003 about puppet, thanks [18:47:06] 6operations, 10hardware-requests: additional graphite machines request, 1x per DC - https://phabricator.wikimedia.org/T126253#2008955 (10fgiunchedi) 3NEW a:3fgiunchedi [18:47:31] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active [18:47:31] PROBLEM - restbase endpoints health on restbase-test2001 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200): /page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /page/title/{title} (Get rev by title from storage) is CR [18:49:20] mobrovac: ^ is the new mathoid problem after a restart? [18:49:39] 10Ops-Access-Requests, 6operations, 3Discovery-Search-Sprint, 5Patch-For-Review: Access for new Discovery OpsEng: Guillaume Lederrey - https://phabricator.wikimedia.org/T125651#2008967 (10Dzahn) 5Open>3Resolved a:3Dzahn [18:50:04] it seems mutante, will investigate [18:50:20] cool [18:50:50] PROBLEM - puppet last run on snapshot1003 is CRITICAL: CRITICAL: puppet fail [18:51:10] RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [18:52:19] 10Ops-Access-Requests, 6operations, 3Discovery-Search-Sprint, 5Patch-For-Review: Access for new Discovery OpsEng: Guillaume Lederrey - https://phabricator.wikimedia.org/T125651#2008999 (10Gehel) checked access to elastic1001: it's working Thanks! [18:52:44] 10Ops-Access-Reviews: Get Alex Monk access to wikitech-static - https://phabricator.wikimedia.org/T125715#2009001 (10Krenair) 5Open>3Resolved [18:52:52] 10Ops-Access-Requests, 6operations, 3Discovery-Search-Sprint: Access for new Discovery OpsEng: Guillaume Lederrey - https://phabricator.wikimedia.org/T125651#2009004 (10Dzahn) [18:54:23] mutante: ^^^ rb is good again, must have asked during the restart [18:54:52] !log mathoid deployed 4bdb2f18c [18:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:55:03] (03CR) 10Dzahn: "we can use parsoid-test-roots as the group, or both, parsoid-test-roots and parsoid-test-admins additonally" [puppet] - 10https://gerrit.wikimedia.org/r/268438 (https://phabricator.wikimedia.org/T125435) (owner: 10Jcrespo) [18:55:11] mobrovac: :) thx [18:57:18] cscott: still into this old change back? https://gerrit.wikimedia.org/r/#/c/170130/ [18:57:41] uploaded in 2014 but yuvipanda said there's already a different role [18:59:24] (03PS1) 10ArielGlenn: dumps: move wikiquery dblist setup into separate class [puppet] - 10https://gerrit.wikimedia.org/r/269180 [18:59:32] PROBLEM - restbase endpoints health on restbase-test2002 is CRITICAL: /page/revision/{revision} (Get rev by ID) is CRITICAL: Test Get rev by ID returned the unexpected status 500 (expecting: 200) [18:59:53] (03PS3) 10Jcrespo: Add access to m5-master:testreduce* dbs for ssastry on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/268438 (https://phabricator.wikimedia.org/T125435) [19:00:37] (03CR) 10Cscott: "I'm planning on redoing the patch to use yuvi's new roles at some point. I suppose I should open a phab task for that, and then reference" [puppet] - 10https://gerrit.wikimedia.org/r/170130 (owner: 10Cscott) [19:01:41] (03CR) 10ArielGlenn: [C: 032] dumps: move wikiquery dblist setup into separate class [puppet] - 10https://gerrit.wikimedia.org/r/269180 (owner: 10ArielGlenn) [19:01:59] (03CR) 10Jcrespo: "The only account created here is ssastry- please ask for personal accounts for everybody on subsequent tickets. Alter tables is included." [puppet] - 10https://gerrit.wikimedia.org/r/268438 (https://phabricator.wikimedia.org/T125435) (owner: 10Jcrespo) [19:03:31] (03PS1) 10ArielGlenn: dumps: fix typo in class name [puppet] - 10https://gerrit.wikimedia.org/r/269181 [19:03:40] !log restart restbase on restbase-test2001.codfw (staging) [19:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:03:54] PROBLEM - restbase endpoints health on restbase-test2001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.16.149, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [19:04:04] RECOVERY - restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy [19:04:31] !log sarin - signing puppet certs, salt-key, initial run [19:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:04:55] (03CR) 10ArielGlenn: [C: 032] dumps: fix typo in class name [puppet] - 10https://gerrit.wikimedia.org/r/269181 (owner: 10ArielGlenn) [19:05:44] RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [19:06:54] RECOVERY - puppet last run on snapshot1003 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [19:12:49] 6operations, 10Traffic: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#2009050 (10ori) Since {e10801bfe8} rolled out on Friday, we can use the presence or absence of the Backend-Timing header to tell apart what percentage of traffic is served from cache objects that are older... [19:15:03] PROBLEM - cassandra-a CQL 10.64.0.202:9042 on xenon is CRITICAL: Connection refused [19:15:05] PROBLEM - cassandra-a service on xenon is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [19:15:37] (03PS4) 10Jcrespo: Add access to m5-master:testreduce* dbs for ssastry on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/268438 (https://phabricator.wikimedia.org/T125435) [19:15:47] 6operations, 10ops-codfw, 10procurement: codfw- db2012and db2019 5x600GB SAS drives purchase request - https://phabricator.wikimedia.org/T126226#2009059 (10RobH) [19:16:51] (03CR) 10Jcrespo: "@Dzahn, as I am modifying modules/admin/data/data.yaml now to comment on this grants, I will ask for a reevaluation of the patch." [puppet] - 10https://gerrit.wikimedia.org/r/268438 (https://phabricator.wikimedia.org/T125435) (owner: 10Jcrespo) [19:16:54] RECOVERY - cassandra-a service on xenon is OK: OK - cassandra-a is active [19:18:14] nuria: OK [19:18:34] RECOVERY - cassandra-a CQL 10.64.0.202:9042 on xenon is OK: TCP OK - 0.004 second response time on port 9042 [19:24:09] (03CR) 10Jcrespo: [C: 031] "Puppet is happy with it, although the admin diff is not very good: https://puppet-compiler.wmflabs.org/1695/ruthenium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/268438 (https://phabricator.wikimedia.org/T125435) (owner: 10Jcrespo) [19:24:40] 6operations, 10Traffic: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#2009122 (10BBlack) My guess would be you're already seeing 75%-ish or more. [19:26:21] (03PS1) 10Madhuvishy: eventlogging: Lower parallelization of mysql consumers temporarily [puppet] - 10https://gerrit.wikimedia.org/r/269185 (https://phabricator.wikimedia.org/T125225) [19:27:19] 6operations, 10ops-codfw: es2004 doesn't come back up after reboot - https://phabricator.wikimedia.org/T126203#2009139 (10Papaul) p:5Triage>3Normal [19:27:40] (03CR) 10Ottomata: "Whoa, no idea. I found the initial commit where it was created, a long time ago when this was a git submodule:" [puppet] - 10https://gerrit.wikimedia.org/r/266975 (owner: 10Dzahn) [19:28:43] (03CR) 10Ottomata: "Ah, the documentation is in cassandra.yaml.erb, and is from the upstream package." [puppet] - 10https://gerrit.wikimedia.org/r/266975 (owner: 10Dzahn) [19:28:49] (03CR) 10Catrope: [C: 032] Have Beta job queue settings shadow production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266949 (owner: 10Mattflaschen) [19:29:57] (03Merged) 10jenkins-bot: Have Beta job queue settings shadow production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266949 (owner: 10Mattflaschen) [19:30:04] bd808: Respected human, time to deploy Debug logging improvements (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160208T1930). Please do the needful. [19:30:29] Will be starting this soon ^ [19:32:15] (03CR) 10Mobrovac: Zotero: Move logs to /srv/log (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/268847 (https://phabricator.wikimedia.org/T107900) (owner: 10Mobrovac) [19:33:27] 6operations, 10Salt, 5Patch-For-Review: setup/deploy sarin(WMF5851) as a salt master in codfw - https://phabricator.wikimedia.org/T125752#2009185 (10Papaul) [19:33:47] is there a user-friendly-ish way to disable all logins on a given wiki? [19:33:51] bd808: word [19:35:10] (03CR) 10Ottomata: [C: 032] eventlogging: Lower parallelization of mysql consumers temporarily [puppet] - 10https://gerrit.wikimedia.org/r/269185 (https://phabricator.wikimedia.org/T125225) (owner: 10Madhuvishy) [19:35:22] (03CR) 10Dzahn: [C: 031] Add access to m5-master:testreduce* dbs for ssastry on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/268438 (https://phabricator.wikimedia.org/T125435) (owner: 10Jcrespo) [19:35:56] andrewbogott: You want to shut down Special:UserLogin + the auth api or ?? [19:36:10] !log mattflaschen@mira Synchronized wmf-config/InitialiseSettings-labs.php: Beta Cluster-only change (duration: 01m 20s) [19:36:10] bd808: mostly the former [19:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:36:22] I would’ve though that unset( $wgSpecialPages['UserLogin'] ); at the end of CommonSettings would do it [19:36:25] but it seems not to [19:36:30] (03PS5) 10Jcrespo: Add access to m5-master:testreduce* dbs for ssastry on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/268438 (https://phabricator.wikimedia.org/T125435) [19:36:55] (03CR) 10Filippo Giunchedi: [C: 031] Zotero: Move logs to /srv/log (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/268847 (https://phabricator.wikimedia.org/T107900) (owner: 10Mobrovac) [19:38:05] andrewbogott: did you hook SpecialPage_initList to do that? [19:38:28] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#2009214 (10Tgr) Per IRC discussion, we stopped the script and will set a secret random $wgAuthenticationTokenVersion value. [19:38:45] bd808: no, is that what I should do? I was hoping to just clobber it at the very end of init [19:38:51] (03CR) 10Jcrespo: [C: 032] Add access to m5-master:testreduce* dbs for ssastry on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/268438 (https://phabricator.wikimedia.org/T125435) (owner: 10Jcrespo) [19:39:15] I think you have to use the hook. See http://www.xpertdeveloper.com/2012/12/disable-login-logout-in-mediawiki/ or the code in SpecialPageFactory [19:40:10] bd808: oh, I looked at that page already but thought the solution seemed too silly to be reasonable [19:40:59] and of course, it fails [19:42:15] akosiaris: round? [19:42:22] 6operations, 10Traffic, 7Performance: Estimate effective cache time for text - https://phabricator.wikimedia.org/T126063#2009228 (10ori) [19:42:24] 6operations, 10Traffic: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#2009229 (10ori) [19:43:26] (03PS1) 10ArielGlenn: admin: add datacenter-ops to neodymium for salt-key use [puppet] - 10https://gerrit.wikimedia.org/r/269192 [19:44:19] (03PS8) 10Nuria: Increase length of lag window to 100 [puppet] - 10https://gerrit.wikimedia.org/r/268594 (https://phabricator.wikimedia.org/T125916) [19:44:42] (03CR) 10ArielGlenn: "maybe we want this in the role instead of having it in these hostname-based files." [puppet] - 10https://gerrit.wikimedia.org/r/269192 (owner: 10ArielGlenn) [19:45:26] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Puppet has 1 failures [19:45:54] that is me, which doesn't realize the difference between a template and a file [19:47:34] (03PS1) 10Jcrespo: Fixing small bug in template instancing for ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/269198 (https://phabricator.wikimedia.org/T125435) [19:47:54] (03PS2) 10Jcrespo: Fixing small bug in template instancing for ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/269198 (https://phabricator.wikimedia.org/T125435) [19:48:41] * bd808 watches the job at the front of zuul [19:49:13] it's soooo slow. Looks like maybe bad node selection by Jenkins [19:49:15] (03CR) 10jenkins-bot: [V: 04-1] Fixing small bug in template instancing for ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/269198 (https://phabricator.wikimedia.org/T125435) (owner: 10Jcrespo) [19:49:20] merges all IRC channels into a single one [19:50:15] how can you not merge that, come on! [19:50:26] it is a one line change! [19:50:35] nobody has changed that! [19:51:02] papaul: I accepted the salt key for sarin [19:51:13] jynus: it actually said both, -1 AND +2 :o [19:51:19] now that is fun [19:51:32] wat? [19:51:56] yes, it complained about merging, not the actual change [19:52:05] (03CR) 10Ori.livneh: [C: 031] SPDY support toggle, off for cp1008 canary [puppet] - 10https://gerrit.wikimedia.org/r/268892 (https://phabricator.wikimedia.org/T125979) (owner: 10BBlack) [19:52:14] ah, maybe it interacted with itself? [19:52:15] (03CR) 10Ori.livneh: [C: 031] disable SPDY for all cache_text [puppet] - 10https://gerrit.wikimedia.org/r/268893 (https://phabricator.wikimedia.org/T125979) (owner: 10BBlack) [19:52:18] yea, and grrrit-wm decided to not talk about the +2 [19:52:25] 6operations, 6Performance-Team, 10Traffic, 5Patch-For-Review: Disable SPDY on cache_text for a week - https://phabricator.wikimedia.org/T125979#2009316 (10ori) p:5Triage>3Normal [19:52:47] now, my manual merge fails because it says no change :-) [19:52:51] lol [19:52:57] (03CR) 10Dzahn: [C: 031] Fixing small bug in template instancing for ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/269198 (https://phabricator.wikimedia.org/T125435) (owner: 10Jcrespo) [19:53:09] oh, I do not need +1 for that [19:53:12] (03CR) 10Dzahn: [C: 032] Fixing small bug in template instancing for ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/269198 (https://phabricator.wikimedia.org/T125435) (owner: 10Jcrespo) [19:53:41] there it goes [19:54:04] let's see what ruth has to say about that [19:54:13] "ruth" i like that [19:54:30] it's a woman's name here, not sure if elsewhere [19:54:33] apergos: thanks [19:54:44] you'll get access to neodymium back shortly [19:54:48] sorry about that! [19:55:00] s/back// [19:55:54] i'm on it, moving that to role [19:55:59] * bd808 is still waiting for jenkins to merge https://gerrit.wikimedia.org/r/#/c/269083/ :(( [19:56:06] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:57:14] apergos: no problem [20:00:34] 6operations, 10Salt, 5Patch-For-Review: setup/deploy sarin(WMF5851) as a salt master in codfw - https://phabricator.wikimedia.org/T125752#2009346 (10Papaul) [20:01:43] (03CR) 10Mobrovac: "+1 for removing it from the class entirely." [puppet] - 10https://gerrit.wikimedia.org/r/266975 (owner: 10Dzahn) [20:02:03] 6operations, 10Salt, 5Patch-For-Review: setup/deploy sarin(WMF5851) as a salt master in codfw - https://phabricator.wikimedia.org/T125752#2009351 (10Papaul) a:5Papaul>3ArielGlenn I am done with the installation of this system. I am assigner the task to Ariel for service implementation. [20:04:27] 6operations, 10Salt, 5Patch-For-Review: setup/deploy sarin(WMF5851) as a salt master in codfw - https://phabricator.wikimedia.org/T125752#2009366 (10ArielGlenn) yay! [20:04:36] (03PS2) 10Dzahn: salt:move datacenter-ops admin group to hiera role [puppet] - 10https://gerrit.wikimedia.org/r/269192 (owner: 10ArielGlenn) [20:04:52] apergos: ^ like so ? [20:05:13] 10Ops-Access-Requests, 6operations, 10DBA, 5Patch-For-Review: Grant mysql client access to testreduce_vd and testreduce_0715 databases - https://phabricator.wikimedia.org/T125435#2009369 (10jcrespo) 5Open>3Resolved ``` ssastry@ruthenium:~$ mysql -u ssastry Welcome to the MySQL monitor. Commands end wi... [20:06:12] mutante: forgive my ignorance but that won't apply to labs right? [20:06:19] in which case awesoem [20:06:52] apergos: no, labs has a separate hiera tree [20:07:00] perfect [20:07:20] oh, wait [20:07:32] i'm removing it from palladium , but should not [20:07:37] unless i also have a role for puppet masters [20:07:57] palladium just happened to be 2 things [20:09:06] (03CR) 10ArielGlenn: [C: 031] "yeah that's much better." [puppet] - 10https://gerrit.wikimedia.org/r/269192 (owner: 10ArielGlenn) [20:11:36] (03PS3) 10Dzahn: salt:move datacenter-ops admin group to hiera role [puppet] - 10https://gerrit.wikimedia.org/r/269192 (owner: 10ArielGlenn) [20:18:46] PROBLEM - cassandra-a CQL 10.64.16.153:9042 on cerium is CRITICAL: Connection refused [20:19:07] PROBLEM - cassandra-a service on cerium is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [20:20:41] (03PS4) 10ArielGlenn: salt:move datacenter-ops admin group to hiera role [puppet] - 10https://gerrit.wikimedia.org/r/269192 [20:22:34] (03CR) 10Ottomata: [C: 032 V: 032] Update AQS config with new syntax [puppet] - 10https://gerrit.wikimedia.org/r/268560 (https://phabricator.wikimedia.org/T122249) (owner: 10Milimetric) [20:23:33] (03PS1) 10Ori.livneh: Add interwiki.php; use it on mw1017 & on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269201 (https://phabricator.wikimedia.org/T122362) [20:24:09] Krinkle: ^ [20:25:47] RECOVERY - cassandra-a CQL 10.64.16.153:9042 on cerium is OK: TCP OK - 0.004 second response time on port 9042 [20:25:56] handled ^^ [20:26:07] RECOVERY - cassandra-a service on cerium is OK: OK - cassandra-a is active [20:26:19] 6operations: Install php5-readline on jessie hosts so eval.php, sql.php, and so on are more useful - https://phabricator.wikimedia.org/T126262#2009454 (10Anomie) 3NEW [20:26:33] !log restbase deploy start of c929ceb [20:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:26:39] (03Abandoned) 10BBlack: set X-CS and similar for all requests [puppet] - 10https://gerrit.wikimedia.org/r/268729 (https://phabricator.wikimedia.org/T126053) (owner: 10BBlack) [20:27:18] 6operations, 10Traffic, 6Zero, 3Mobile-Content-Service, and 3 others: Send X-Carrier + X-Carrier-Meta headers on all responses - https://phabricator.wikimedia.org/T126053#2009467 (10BBlack) [20:27:54] (03PS5) 10Dzahn: salt:move datacenter-ops admin group to hiera role [puppet] - 10https://gerrit.wikimedia.org/r/269192 (owner: 10ArielGlenn) [20:28:33] 6operations, 10Traffic, 6Zero, 3Mobile-Content-Service, and 3 others: Send X-Carrier + X-Carrier-Meta headers on all responses - https://phabricator.wikimedia.org/T126053#2003432 (10BBlack) Updated task title, in a recent meeting we decided to avoid the legacy X-CS[2] header names (which are full of legacy... [20:29:51] (03CR) 10Alex Monk: "so it sounds like updateinterwikicache needs to be updated to also generate one of these?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269201 (https://phabricator.wikimedia.org/T122362) (owner: 10Ori.livneh) [20:30:12] (03PS1) 10BBlack: Emit X-Carrier[-Meta] headers to all clients for all requests [puppet] - 10https://gerrit.wikimedia.org/r/269205 (https://phabricator.wikimedia.org/T126053) [20:32:25] (03CR) 10Dzahn: [C: 032] salt:move datacenter-ops admin group to hiera role [puppet] - 10https://gerrit.wikimedia.org/r/269192 (owner: 10ArielGlenn) [20:34:07] papaul, thanks to mu tante you shuold have neodymium access now [20:34:12] *should [20:34:15] !log restbase deploy end of c929ceb [20:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:34:19] (03PS2) 10BBlack: Emit X-Carrier[-Meta] headers to all clients for all requests [puppet] - 10https://gerrit.wikimedia.org/r/269205 (https://phabricator.wikimedia.org/T126053) [20:34:35] of course he's gone. ah well [20:35:49] (03CR) 10BBlack: [C: 032] Emit X-Carrier[-Meta] headers to all clients for all requests [puppet] - 10https://gerrit.wikimedia.org/r/269205 (https://phabricator.wikimedia.org/T126053) (owner: 10BBlack) [20:38:46] (03PS1) 10BBlack: X-C/X-C-M: only set from frontends [puppet] - 10https://gerrit.wikimedia.org/r/269208 [20:39:01] (03PS2) 10BBlack: X-C/X-C-M: only set from frontends [puppet] - 10https://gerrit.wikimedia.org/r/269208 [20:39:07] (03CR) 10BBlack: [C: 032 V: 032] X-C/X-C-M: only set from frontends [puppet] - 10https://gerrit.wikimedia.org/r/269208 (owner: 10BBlack) [20:41:11] bblack, i would change it a bit -- in https://gerrit.wikimedia.org/r/#/c/269205/2/modules/varnish/templates/vcl/wikimedia.vcl.erb - line 589, make it nested if [20:41:55] yurik: it's not an expensive check anyways, and the nesting is already done when they're set initially [20:41:58] but we can! [20:42:06] bblack, call me paranoid :) [20:43:03] (03PS1) 10ArielGlenn: sarin as codfw redundant salt master [puppet] - 10https://gerrit.wikimedia.org/r/269209 [20:43:15] (03PS1) 10BBlack: X-C/X-C-M: nest check for minor efficiency gain [puppet] - 10https://gerrit.wikimedia.org/r/269210 [20:43:23] bblack, its not about being expensive, its about not ever producing meta without carrier, by definition :) [20:43:44] (03CR) 10BBlack: [C: 032 V: 032] X-C/X-C-M: nest check for minor efficiency gain [puppet] - 10https://gerrit.wikimedia.org/r/269210 (owner: 10BBlack) [20:43:52] yurik: we already don't on the inbound side, but sure [20:43:59] thx :) [20:44:30] (03PS2) 10ArielGlenn: sarin as codfw redundant salt master [puppet] - 10https://gerrit.wikimedia.org/r/269209 [20:44:36] (03PS3) 10Dzahn: Zotero: Move logs to /srv/log [puppet] - 10https://gerrit.wikimedia.org/r/268847 (https://phabricator.wikimedia.org/T107900) (owner: 10Mobrovac) [20:46:27] (03CR) 10Dzahn: [C: 032] Zotero: Move logs to /srv/log [puppet] - 10https://gerrit.wikimedia.org/r/268847 (https://phabricator.wikimedia.org/T107900) (owner: 10Mobrovac) [20:46:45] (03PS3) 10ArielGlenn: sarin as codfw redundant salt master [puppet] - 10https://gerrit.wikimedia.org/r/269209 [20:48:30] (03CR) 10ArielGlenn: [C: 04-1] "actually don't merge this til I have the salt master key in the private repo" [puppet] - 10https://gerrit.wikimedia.org/r/269209 (owner: 10ArielGlenn) [20:48:36] (03PS1) 10Hashar: nodepool: rotate daily at midnight [puppet] - 10https://gerrit.wikimedia.org/r/269213 [20:49:12] 6operations: Install php5-readline on trusty and jessie hosts so eval.php, sql.php, and so on are more useful - https://phabricator.wikimedia.org/T126262#2009551 (10Anomie) [20:50:23] (03CR) 10Hashar: "From labnodepool1001.eqiad.wmnet, the files are rotated around 14:00 UTC which is in the middle of the image creation. As a result the log" [puppet] - 10https://gerrit.wikimedia.org/r/269213 (owner: 10Hashar) [20:51:05] (03CR) 10Hashar: "Sole host impacted is labnodepool1001.eqiad.wmnet . Maybe nodepool will have to be restarted which I can handle (got sudo)" [puppet] - 10https://gerrit.wikimedia.org/r/269213 (owner: 10Hashar) [20:53:53] 6operations, 7Monitoring: [RFC] Alert about *when* partitions will run out of space, not a percentage/absolute number - https://phabricator.wikimedia.org/T126158#2009570 (10Dzahn) The experience with icinga-graphite checks to me has been "always sounds great in theory but in practice it reports something like... [20:54:36] (03PS1) 10Ottomata: Revert "Update AQS config with new syntax" [puppet] - 10https://gerrit.wikimedia.org/r/269216 [20:55:20] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#2009588 (10bd808) [21:00:04] gwicke cscott arlolra subbu bearND mdholloway: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160208T2100). Please do the needful. [21:00:51] alright .. deploy time. [21:02:03] !log resuming rolling reboots of cpNNNN caches for kernel updates [21:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:03:07] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [21:03:56] 6operations, 7Mail: delete exim alias wikilibrary@ library@ - https://phabricator.wikimedia.org/T123666#2009623 (10Dzahn) created a ticket in Zendesk for this [21:04:07] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [21:04:24] * apergos is here, subbu [21:04:25] !log starting parsoid deploy [21:04:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:04:58] * subbu might need apergos to kill -9 some stuck workers .. hopefully for the last time. [21:05:11] okey dokey [21:07:18] (03PS1) 10Ori.livneh: scap: delete updateinterwikicache script [puppet] - 10https://gerrit.wikimedia.org/r/269222 [21:07:57] (03PS2) 10Ori.livneh: scap: delete updateinterwikicache script [puppet] - 10https://gerrit.wikimedia.org/r/269222 [21:09:13] (03CR) 10Ori.livneh: [C: 032 V: 032] "Will delete leftover file manually." [puppet] - 10https://gerrit.wikimedia.org/r/269222 (owner: 10Ori.livneh) [21:10:17] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [21:10:37] (03PS1) 10Krinkle: cache: Change static_host from www.wikimedia.org to en.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/269224 [21:10:39] !log synced code; restarted parsoid on wtp1003 as a canary [21:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:11:08] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [21:11:32] 6operations, 6Parsing-Team, 10hardware-requests: Dedicated server for running Parsoid's roundtrip tests to get reliable parse latencies and use as perf. benchmarking tests - https://phabricator.wikimedia.org/T116090#2009631 (10RobH) Isn't this now going to be handled by bare metal in labs? Please advise. [21:12:36] (03PS3) 10Krinkle: cache: Normalise hostname for /w/skins,resources,extensions [puppet] - 10https://gerrit.wikimedia.org/r/269149 (https://phabricator.wikimedia.org/T99096) [21:12:56] 6operations, 10DBA, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: prepare for mariadb 10.0 masters - https://phabricator.wikimedia.org/T105135#2009637 (10RobH) [21:12:58] 6operations, 10hardware-requests: dbproxy servers for codfw - https://phabricator.wikimedia.org/T109116#2009636 (10RobH) 5stalled>3declined [21:13:01] (03CR) 10Krinkle: "Simplified to re-use the existing static_hostname config." [puppet] - 10https://gerrit.wikimedia.org/r/269149 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [21:13:13] wtp1003 looking good. will restart parsoid on all nodes after waiting for another minute or so. [21:13:18] ok [21:15:40] (03CR) 10Ori.livneh: [C: 032] "* updateinterwikicache got removed in https://gerrit.wikimedia.org/r/#/c/269201/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269201 (https://phabricator.wikimedia.org/T122362) (owner: 10Ori.livneh) [21:18:48] (03CR) 10Alex Monk: "link error, but I found the commit removing updateinterwikicache: https://gerrit.wikimedia.org/r/#/c/269222/2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269201 (https://phabricator.wikimedia.org/T122362) (owner: 10Ori.livneh) [21:19:07] ori, thanks for cleaning this up [21:19:13] I really dislike CDBs [21:19:19] thanks for reviewing [21:19:25] !log finished deploying parsoid version 4d44fcc7 [21:19:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:19:28] and yeah, me too [21:20:05] apergos, so, no stuck processes ... restart done. [21:20:30] suppose I need to update the wiki creation process again [21:21:15] where is that documented? [21:21:22] wikitech, [[Add a wiki]] [21:21:30] wow that's great [21:21:53] subbu: so let's wait 5 mins (so there's some queries going on maybe a long one or two) and try the restart [21:21:56] ok? [21:21:56] jouncebot: refresh [21:21:57] I refreshed my knowledge about deployments. [21:22:00] most of the times, there aren't stuck processes .. so, last time's expereince was a bit unusual. [21:22:08] heh [21:22:09] apergos, sure. [21:22:19] I guess you restarted with the bash script right? [21:22:20] !log ori@mira Synchronized wmf-config/interwiki.php: Ib599f9984a: Add interwiki.php; use it on mw1017 & on labs (1/2) (duration: 01m 20s) [21:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:22:36] yes. [21:22:43] so how long does it take before parsoid will give up and shoot processes, i.e. is it 3 minutes, 5? [21:22:48] I know there's some time limit [21:22:52] for the restart [21:22:53] we have a cpu timeout of 3 mins. [21:23:07] uh that's 3 realtime minutes or 3 cpu minutes? [21:23:22] how many minutes do we need on the command line? [21:23:24] used to be 5 min .. we brought it down to 4 and then 3 .. 3 real time. [21:23:27] greg-g: I rescheduled my deploy window for 22:30Z. Hopefully this time zuul will cooperate [21:23:27] ok [21:23:28] so [21:23:37] bd808: godspeed [21:23:37] !log ori@mira Synchronized wmf-config/CommonSettings.php: Ib599f9984a: Add interwiki.php; use it on mw1017 & on labs (2/2) (duration: 01m 16s) [21:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:23:46] you'l run the usual git deploy command but add --timeout 185 I guess [21:24:00] to give enough time for the restarts to finish even with that [21:24:09] so git deploy service restart --timeout 185 [21:24:12] yeah [21:24:33] if they don't hear back from the minions by then (if the master doesn't hear back from them all) it will time out [21:24:41] and give you that error you saw [21:24:53] ok. [21:24:56] PROBLEM - Host cp3039 is DOWN: PING CRITICAL - Packet loss = 100% [21:24:59] And I guess it's too late for this to be in wmf.13 too [21:25:06] when you run it, it would be cool if you run date before and after [21:25:15] so we know how long it sat and waited too [21:25:15] sure. [21:25:18] ok [21:25:19] Which means I'll need to backport to be able to do the planned wiki creation on thursday [21:25:45] apergos, usage: deploy ... [21:25:45] deploy: error: unrecognized arguments: --timeout 185 [21:25:57] hm [21:26:00] second [21:26:06] (03PS1) 10Ori.livneh: Enable static PHP interwiki cache on mediawikiwiki and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269294 [21:26:08] hm, actually wmf.13 branch is tomorrow. today is monday [21:26:12] 6operations, 7Mail: move grants aliases to OIT? - https://phabricator.wikimedia.org/T83791#2009702 (10Dzahn) most of those have already been removed meanwhile it seems, just these are left: ## Grants ## grant: grants grants: awang, jtud, kharold grantsadmin: jtud i mailed Zendesk to... [21:26:15] maybe it'll be merged before then [21:26:47] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3039_v4, cp3039_v6 [21:26:47] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 138 not-conn: cp3039_v4, cp3039_v6 [21:26:47] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 138 not-conn: cp3039_v4, cp3039_v6 [21:27:02] :P at ipsec [21:27:07] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3039_v4, cp3039_v6 [21:27:08] PROBLEM - IPsec on cp1051 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3039_v4, cp3039_v6 [21:27:08] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3039_v4, cp3039_v6 [21:27:09] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3039_v4, cp3039_v6 [21:27:09] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3039_v4, cp3039_v6 [21:27:11] (03CR) 10Ori.livneh: [C: 032] Enable static PHP interwiki cache on mediawikiwiki and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269294 (owner: 10Ori.livneh) [21:27:18] (that's me, another cache doesn't want to reboot itself) [21:27:26] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3039_v4, cp3039_v6 [21:27:36] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3039_v4, cp3039_v6 [21:27:37] PROBLEM - IPsec on cp1061 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3039_v4, cp3039_v6 [21:27:37] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 136 not-conn: cp3039_v4, cp3039_v6, cp4010_v4, cp4010_v6 [21:27:37] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3039_v4, cp3039_v6 [21:27:46] PROBLEM - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 136 not-conn: cp3039_v4, cp3039_v6, cp4010_v4, cp4010_v6 [21:27:46] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3039_v4, cp3039_v6 [21:27:56] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3039_v4, cp3039_v6 [21:27:57] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 136 not-conn: cp3039_v4, cp3039_v6, cp4010_v4, cp4010_v6 [21:28:06] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3039_v4, cp3039_v6 [21:28:07] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 136 not-conn: cp3039_v4, cp3039_v6, cp4010_v4, cp4010_v6 [21:28:24] !log Restarting HHVM on mw1017 to wipe APC cache [21:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:29:39] 6operations, 7Mail, 7Mobile, 5Patch-For-Review: consolidate mailman redirects in exim aliases file - https://phabricator.wikimedia.org/T123581#2009722 (10Dzahn) >>! In T123581#1983651, @Peachey88 wrote: > Yes, it can be done on gApps via the groups system. cool, thank you! i made a zendesk ticket for this [21:31:46] !log ori@mira Synchronized wmf-config/CommonSettings.php: I39c9ecd4b: Enable static PHP interwiki cache on mediawikiwiki and testwiki (duration: 01m 18s) [21:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:33:51] still here, trying to track down the issue [21:35:59] np .. i'll be afk for a little bit [21:39:47] 6operations, 6Parsing-Team, 10hardware-requests: Dedicated server for running Parsoid's roundtrip tests to get reliable parse latencies and use as perf. benchmarking tests - https://phabricator.wikimedia.org/T116090#2009801 (10Dzahn) I think it needs @Andrew to give access to the labs-test machine to subbu. [21:41:29] subbu: bah, I can't find it in 5 minutes so we'll try again next time [21:41:33] sorry about the mixup [21:42:04] (03PS2) 10Dzahn: wikitech: wikitech.m.wikimedia.org -> CNAME silver [dns] - 10https://gerrit.wikimedia.org/r/268734 (https://phabricator.wikimedia.org/T120527) [21:44:02] apergos, sounds good. [21:44:21] thanks for letting me waste a bit of your time :-) [21:45:35] (03PS1) 10Ori.livneh: Use interwiki.php on all wikis; delete unused interwiki.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269297 (https://phabricator.wikimedia.org/T122362) [21:45:48] 6operations, 6Parsing-Team, 10hardware-requests: Dedicated server for running Parsoid's roundtrip tests to get reliable parse latencies and use as perf. benchmarking tests - https://phabricator.wikimedia.org/T116090#2009818 (10ssastry) I am meeting with @Andrew tomorrow to discuss this and related matters ..... [21:46:29] apergos, :-) [21:46:58] (03CR) 10Ori.livneh: [C: 032] Use interwiki.php on all wikis; delete unused interwiki.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269297 (https://phabricator.wikimedia.org/T122362) (owner: 10Ori.livneh) [21:47:38] (03Merged) 10jenkins-bot: Use interwiki.php on all wikis; delete unused interwiki.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269297 (https://phabricator.wikimedia.org/T122362) (owner: 10Ori.livneh) [21:47:47] PROBLEM - puppet last run on achernar is CRITICAL: CRITICAL: puppet fail [21:51:40] !log ori@mira Synchronized wmf-config/CommonSettings.php: Ie9bdd77fb: Use interwiki.php on all wikis; delete unused interwiki.json (duration: 01m 19s) [21:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:51:46] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 58 ESP OK [21:51:46] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 140 ESP OK [21:51:47] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 140 ESP OK [21:51:47] RECOVERY - Host cp3039 is UP: PING OK - Packet loss = 0%, RTA = 85.79 ms [21:51:57] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 58 ESP OK [21:51:58] RECOVERY - IPsec on cp1051 is OK: Strongswan OK - 58 ESP OK [21:52:06] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 58 ESP OK [21:52:07] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 58 ESP OK [21:52:07] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 58 ESP OK [21:52:17] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 58 ESP OK [21:52:26] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 58 ESP OK [21:52:26] RECOVERY - IPsec on cp1061 is OK: Strongswan OK - 58 ESP OK [21:52:27] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 140 ESP OK [21:52:27] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 58 ESP OK [21:52:36] RECOVERY - IPsec on kafka1018 is OK: Strongswan OK - 140 ESP OK [21:52:36] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 58 ESP OK [21:52:47] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 58 ESP OK [21:52:47] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 140 ESP OK [21:52:56] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 58 ESP OK [21:52:57] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 140 ESP OK [21:54:56] RECOVERY - Host cp3032 is UP: PING WARNING - Packet loss = 58%, RTA = 86.04 ms [21:55:14] (03PS1) 10Ori.livneh: Update missing.php for interwiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269301 [21:55:43] (03CR) 10jenkins-bot: [V: 04-1] Update missing.php for interwiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269301 (owner: 10Ori.livneh) [21:55:53] (03PS1) 10BBlack: Revert "disable ipsec for cp3032, dead" [puppet] - 10https://gerrit.wikimedia.org/r/269302 [21:56:06] (03CR) 10BBlack: [C: 032 V: 032] Revert "disable ipsec for cp3032, dead" [puppet] - 10https://gerrit.wikimedia.org/r/269302 (owner: 10BBlack) [21:56:11] (03PS9) 10Ottomata: Increase length of lag window to 100 [puppet] - 10https://gerrit.wikimedia.org/r/268594 (https://phabricator.wikimedia.org/T125916) (owner: 10Nuria) [21:56:55] (03CR) 10Ottomata: [C: 032 V: 032] Increase length of lag window to 100 [puppet] - 10https://gerrit.wikimedia.org/r/268594 (https://phabricator.wikimedia.org/T125916) (owner: 10Nuria) [21:57:01] (03PS2) 10Ori.livneh: Update missing.php for interwiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269301 [21:57:12] oops bblack [21:57:14] i just merged that [21:57:25] puppet-merged [21:57:36] + - 'cp3032.esams.wmnet' [21:57:44] (03CR) 10Ori.livneh: [C: 032] Update missing.php for interwiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269301 (owner: 10Ori.livneh) [21:57:49] bblack, is ok? [21:58:32] (03Merged) 10jenkins-bot: Update missing.php for interwiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269301 (owner: 10Ori.livneh) [21:58:41] ottomata: yes [21:59:05] ok phew [21:59:57] PROBLEM - Freshness of OCSP Stapling files on cp3032 is CRITICAL: CRITICAL: File /var/cache/ocsp/unified.ocsp is more than 29100 secs old! [22:01:36] (03CR) 10Dzahn: "need to make sure this is in Apache config on silver. is it?" [dns] - 10https://gerrit.wikimedia.org/r/268734 (https://phabricator.wikimedia.org/T120527) (owner: 10Dzahn) [22:01:50] (03PS1) 10Ori.livneh: createTxtFileSymlinks.sh: drop interwiki.cdb; add interwiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269303 [22:05:22] (03PS1) 10Nuria: Correcting typo in hiera for burrow.yaml [puppet] - 10https://gerrit.wikimedia.org/r/269304 [22:06:42] (03CR) 10Alex Monk: "Just has "ServerName wikitech.wikimedia.org" and "ServerAlias wmflabs.org www.wmflabs.org"" [dns] - 10https://gerrit.wikimedia.org/r/268734 (https://phabricator.wikimedia.org/T120527) (owner: 10Dzahn) [22:07:50] (03PS2) 10Nuria: Correcting typo in hiera for burrow.yaml [puppet] - 10https://gerrit.wikimedia.org/r/269304 [22:07:59] 6operations, 10ops-esams, 10Traffic: cp30[34]x hw/firmware/BMC issues - https://phabricator.wikimedia.org/T126062#2009923 (10BBlack) [22:08:00] ottomata: done [22:08:51] (03CR) 10Ottomata: [C: 032 V: 032] Correcting typo in hiera for burrow.yaml [puppet] - 10https://gerrit.wikimedia.org/r/269304 (owner: 10Nuria) [22:08:54] meh, those diffusion links in gerrit are so useless [22:09:22] maybe I should figure out what changed them and try to get it reverted [22:10:38] RECOVERY - Freshness of OCSP Stapling files on cp3032 is OK: OK [22:11:16] PROBLEM - Hadoop HistoryServer on analytics1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer [22:12:07] PROBLEM - traffic-pool service on cp3032 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is inactive [22:12:22] (03CR) 10Alex Monk: "Please run the script and commit the changes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269303 (owner: 10Ori.livneh) [22:12:57] RECOVERY - Hadoop HistoryServer on analytics1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer [22:13:04] 6operations, 10Deployment-Systems, 10Salt, 5Patch-For-Review: [Trebuchet] Salt times out on parsoid restarts - https://phabricator.wikimedia.org/T63882#2009937 (10ArielGlenn) hm, testing failed for want of an argument to the git deploy restart code (trigger). and service-restart no longer gets deployed.... [22:13:56] RECOVERY - traffic-pool service on cp3032 is OK: OK - traffic-pool is active [22:14:27] RECOVERY - puppet last run on achernar is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [22:21:40] (03CR) 10Papaul: [C: 031] netboot.cfg - replace tabs with spaces [puppet] - 10https://gerrit.wikimedia.org/r/268717 (owner: 10Dzahn) [22:25:12] Looks like https://stats.wikipedia.org/ no longer redirects [22:25:19] showing a Varnish error: Domain not served herre instead [22:25:45] Krinkle: stats.wikimedia.org [22:26:13] mutante: Yes, that was the redirect destination [22:26:30] 6operations, 10ops-eqiad, 5Patch-For-Review: SMART errors on kafka1012.eqiad.wmfnet - https://phabricator.wikimedia.org/T125199#2009992 (10Cmjohnson) @ottomata did this recover completely? [22:27:08] 6operations, 6Services: SCA: Move logs to /srv/ - https://phabricator.wikimedia.org/T107900#2009993 (10mobrovac) 5Open>3Resolved [22:28:28] mutante: stats.wikimedia.org.conf contains a redirect [22:30:04] bd808: Respected human, time to deploy Debug logging improvements (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160208T2230). Please do the needful. [22:30:53] (03PS1) 10Subramanya Sastry: parsoid-vd-client: Set screenShotDelay to 5 seconds [puppet] - 10https://gerrit.wikimedia.org/r/269314 [22:31:00] 7Blocked-on-Operations, 10Analytics-Wikistats, 7Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281#2010000 (10Krinkle) 3NEW [22:31:20] 6operations, 10Analytics-Wikistats, 7Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281#2010007 (10Krinkle) [22:33:29] * bd808 will try to deploy again [22:33:59] (03CR) 10Ori.livneh: [C: 031] cache: Change static_host from www.wikimedia.org to en.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/269224 (owner: 10Krinkle) [22:34:42] (03CR) 10Ori.livneh: [C: 032] createTxtFileSymlinks.sh: drop interwiki.cdb; add interwiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269303 (owner: 10Ori.livneh) [22:35:07] (03Merged) 10jenkins-bot: createTxtFileSymlinks.sh: drop interwiki.cdb; add interwiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269303 (owner: 10Ori.livneh) [22:35:36] (03CR) 10BBlack: "We'll have to deploy this a little slowly too, so we don't effectively wipe them all in one go. Should be fine to start with a lower-traf" [puppet] - 10https://gerrit.wikimedia.org/r/269224 (owner: 10Krinkle) [22:35:41] 6operations, 10ops-eqiad, 5Patch-For-Review: SMART errors on kafka1012.eqiad.wmfnet - https://phabricator.wikimedia.org/T125199#2010019 (10Ottomata) 5Open>3Resolved Yes, looking good! Thank you! [22:37:34] (03PS1) 10Ori.livneh: Rebuild interwiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269315 [22:37:50] ^ Krenair [22:40:01] (03PS1) 10Ori.livneh: Remove interwiki.cdb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269318 (https://phabricator.wikimedia.org/T122362) [22:40:48] ori, nice [22:41:21] bd808: still deploying? [22:41:28] I think security-wise it's fine as we'll definitely be reviewing these [22:41:48] ori: yes. been stuck in a a meeting that ran long [22:41:55] np, just wondering [22:42:14] * bd808 rungs sync-file [22:42:17] *runs [22:42:40] "Warning: fopen() expects parameter 1 to be string, array given in /srv/m [22:42:40] ediawiki/multiversion/vendor/wikimedia/cdb/src/Reader/PHP.php on line 69" [22:43:05] looks like that may be a leftover from earlier? [22:43:21] !log bd808@mira Synchronized php-1.27.0-wmf.12/includes/debug/logger/monolog/WikiProcessor.php: Add $wgVersion to MediaWiki\Logger\Monolog\WikiProcessor (3cea726) (duration: 01m 19s) [22:43:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:43:27] hrm -- do you have the caller? [22:43:36] no, it's getting new hit from somewhere but slowly [22:43:52] like 1 new log event a minute or something slow like that [22:44:17] oh, that's probably missing.php, which I haven't synced yet [22:44:21] ori, oh, did you not see my comment on https://gerrit.wikimedia.org/r/#/c/269303/ ? [22:44:24] ori: I'm just seeing it on the top of the fluorine fatalmonitor [22:44:50] Krenair: no, missed that. Will do so now [22:44:54] k [22:44:56] ori: do you want to hop in an sync that if it's ready? [22:45:06] (03CR) 10Subramanya Sastry: [C: 04-1] "Testing this with default after the visualdiff bug is fixed. Will push this patch if necessary." [puppet] - 10https://gerrit.wikimedia.org/r/269314 (owner: 10Subramanya Sastry) [22:45:09] bd808: sure, if you don't mind [22:45:27] yeah no problem. I'm going to check the mwversion thing I just synced now [22:46:28] (03PS1) 10Ori.livneh: Update symlinks following Ifd7fe8c3c [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269324 [22:46:39] (03CR) 10Ori.livneh: [C: 032] Remove interwiki.cdb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269318 (https://phabricator.wikimedia.org/T122362) (owner: 10Ori.livneh) [22:48:01] (03CR) 10Ori.livneh: [C: 032] Update symlinks following Ifd7fe8c3c [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269324 (owner: 10Ori.livneh) [22:49:14] (03CR) 10Papaul: "on linux-host-entries.ttyS1-115200 lines" [puppet] - 10https://gerrit.wikimedia.org/r/268706 (owner: 10Dzahn) [22:49:24] ori: those cdb reader errors are actually pretty hot right now -- https://logstash.wikimedia.org/#/dashboard/elasticsearch/fatalmonitor [22:49:25] (03CR) 10Ori.livneh: [V: 032] Update symlinks following Ifd7fe8c3c [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269324 (owner: 10Ori.livneh) [22:49:49] they're for missing.php; it'll be fine in a min [22:50:56] (03PS3) 10BryanDavis: Monolog: normalize messages before PSR3 expansion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269063 (https://phabricator.wikimedia.org/T124985) [22:51:33] !log ori@mira Synchronized docroot and w: Ifd7fe8c3c: createTxtFileSymlinks.sh: drop interwiki.cdb; add interwiki.php (duration: 01m 21s) [22:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:52:03] sync-masters is slow [22:52:10] yeah it is [22:52:21] !log ori@mira Synchronized wmf-config/missing.php: Ib5407c560: Update missing.php for interwiki.php (duration: 01m 18s) [22:52:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:52:34] it was fast when /srv/mediawiki-staging was empty ;) [22:53:01] bd808: done [22:53:06] cool. [22:53:12] thanks again [22:53:18] np [22:55:37] ori: that fopen() warning doesn't seem to be going away [22:57:35] it started with the interwiki.php change at 14:51 [22:58:29] (03CR) 10BryanDavis: [C: 032] Monolog: normalize messages before PSR3 expansion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269063 (https://phabricator.wikimedia.org/T124985) (owner: 10BryanDavis) [22:59:19] 10Ops-Access-Requests, 6operations, 6Services: Requesting restbase-admins access to RESTBase cluster for Petr Pchelko - https://phabricator.wikimedia.org/T126283#2010055 (10GWicke) @pchelolo, could you add a public SSH key to use for this? Per [1](https://wikitech.wikimedia.org/wiki/Requesting_shell_access),... [22:59:30] ori: is the dirty wmf-config/interwiki.cdb on mira expected? [23:01:00] (03Merged) 10jenkins-bot: Monolog: normalize messages before PSR3 expansion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269063 (https://phabricator.wikimedia.org/T124985) (owner: 10BryanDavis) [23:01:38] (03CR) 10Aaron Schulz: "Yes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266609 (https://phabricator.wikimedia.org/T91869) (owner: 10Aaron Schulz) [23:02:38] 10Ops-Access-Requests, 6operations, 6Services: Requesting restbase-admins access to RESTBase cluster for Petr Pchelko - https://phabricator.wikimedia.org/T126283#2010062 (10Pchelolo) ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDSxkE+b4Jc+3FoCgYqZvQJZ8a0Hk2UhC2Qb1zi1CiThsE8oBPf6n1Mki58o/mHBrtfgAPutCFFylkLwuPDE5tDoj... [23:04:59] !log bd808@mira Synchronized wmf-config/logging.php: Monolog: normalize messages before PSR3 expansion (e5ee5d8) (duration: 01m 18s) [23:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:06:28] 10Ops-Access-Requests, 6operations, 6Services: Requesting restbase-admins access to RESTBase cluster for Petr Pchelko - https://phabricator.wikimedia.org/T126283#2010069 (10mobrovac) @Pchelolo you also need to read and sign {L3}/ [23:07:28] (03PS2) 10BryanDavis: Monolog: Add mwversion to udp2log log events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269065 (https://phabricator.wikimedia.org/T125707) [23:07:38] (03CR) 10BryanDavis: [C: 032] Monolog: Add mwversion to udp2log log events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269065 (https://phabricator.wikimedia.org/T125707) (owner: 10BryanDavis) [23:14:02] (03Merged) 10jenkins-bot: Monolog: Add mwversion to udp2log log events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269065 (https://phabricator.wikimedia.org/T125707) (owner: 10BryanDavis) [23:16:14] !log bd808@mira Synchronized wmf-config/logging.php: Monolog: Add mwversion to udp2log log events (9b54967) (duration: 01m 18s) [23:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:17:05] (03PS2) 10BryanDavis: logging: Send all udp2log eligible messages to $wmgDefaultMonologHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269068 (https://phabricator.wikimedia.org/T117019) [23:17:13] (03CR) 10BryanDavis: [C: 032] logging: Send all udp2log eligible messages to $wmgDefaultMonologHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269068 (https://phabricator.wikimedia.org/T117019) (owner: 10BryanDavis) [23:18:20] (03Merged) 10jenkins-bot: logging: Send all udp2log eligible messages to $wmgDefaultMonologHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269068 (https://phabricator.wikimedia.org/T117019) (owner: 10BryanDavis) [23:19:37] dr0ptp4kt: it would be great if you can take a look at https://phabricator.wikimedia.org/T106064 [23:20:18] !log bd808@mira Synchronized wmf-config/logging.php: logging: Send all udp2log eligible messages to $wmgDefaultMonologHandler (cd25586) (duration: 01m 17s) [23:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:21:13] (03PS3) 10BryanDavis: logging: Collect mw1017 logs for debugging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269069 (https://phabricator.wikimedia.org/T117020) [23:21:20] (03CR) 10BryanDavis: [C: 032] logging: Collect mw1017 logs for debugging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269069 (https://phabricator.wikimedia.org/T117020) (owner: 10BryanDavis) [23:21:51] (03Merged) 10jenkins-bot: logging: Collect mw1017 logs for debugging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269069 (https://phabricator.wikimedia.org/T117020) (owner: 10BryanDavis) [23:22:27] mutante: i'm going to schedule a meeting with you two so that i can understand better [23:22:42] dr0ptp4kt: eh, i dont think that's needed [23:23:42] dr0ptp4kt: i think he can also just sign the document [23:23:49] (03PS1) 10Legoktm: zuul-test-repo: Allow testing multiple repositories at once [puppet] - 10https://gerrit.wikimedia.org/r/269328 [23:24:00] !log bd808@mira Synchronized wmf-config/logging.php: logging: Collect mw1017 logs for debugging (9d6d0e0) (duration: 01m 18s) [23:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:24:25] mutante: ok, if it's that simple from a procedural standpoint, i trust you. i'm going to schedule the meeting nonetheless [23:24:28] 6operations, 7HHVM, 7Tracking: Complete the use of HHVM over Zend PHP on the Wikimedia cluster (tracking) - https://phabricator.wikimedia.org/T86081#2010135 (10hashar) This task is a blocker for getting rid of Zend 5.3 Jenkins jobs for WMF branches and master branches deployed on Wikimedia infra ( T94149 ).... [23:25:20] (03CR) 10jenkins-bot: [V: 04-1] zuul-test-repo: Allow testing multiple repositories at once [puppet] - 10https://gerrit.wikimedia.org/r/269328 (owner: 10Legoktm) [23:26:22] ori: I'm done on mira if you've got that fix ready [23:26:41] bd808: yep; are you up for reviewing it? https://gerrit.wikimedia.org/r/#/c/269329/ [23:27:02] yeah I just opened the prior patch to compare [23:27:20] (03PS2) 10Legoktm: zuul-test-repo: Allow testing multiple repositories at once [puppet] - 10https://gerrit.wikimedia.org/r/269328 [23:28:42] ori: :shipit: [23:28:52] 10Ops-Access-Requests, 6operations, 6Services: Requesting restbase-admins access to RESTBase cluster for Petr Pchelko - https://phabricator.wikimedia.org/T126283#2010145 (10Pchelolo) @mobrovac read and signed. [23:29:55] bd808: thanks, bd808 (for noticing/reporting the bug and reviewing the fix) [23:30:52] !log ori@mira Synchronized php-1.27.0-wmf.12/includes/interwiki/Interwiki.php: ac6e170fa5: Fix-up for I5a979f047031e (duration: 01m 18s) [23:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:31:17] (03CR) 10Ori.livneh: [C: 032] Rebuild interwiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269315 (owner: 10Ori.livneh) [23:32:14] ori: you might want to cleanup wmf-config/interwiki.cdb while you are there. I stashed it while I did things and then popped the stash [23:32:23] (03Merged) 10jenkins-bot: Rebuild interwiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269315 (owner: 10Ori.livneh) [23:32:31] bd808: it gets removed in a commit that should be merged in a sec [23:32:37] perfect [23:38:48] PROBLEM - cassandra-a service on praseodymium is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [23:39:16] PROBLEM - cassandra-a CQL 10.64.16.188:9042 on praseodymium is CRITICAL: Connection refused [23:43:27] PROBLEM - restbase endpoints health on cerium is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200): /page/html/{title} (Get html by title from storage) is CRITICAL: Test Get htm [23:43:38] PROBLEM - restbase endpoints health on xenon is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200): /page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html [23:43:46] PROBLEM - cassandra-a service on cerium is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [23:44:07] PROBLEM - restbase endpoints health on praseodymium is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200): /page/html/{title} (Get html by title from storage) is CRITICAL: Test G [23:44:47] PROBLEM - cassandra-a CQL 10.64.16.153:9042 on cerium is CRITICAL: Connection refused [23:45:25] this is restbase staging ^^ [23:45:48] RECOVERY - cassandra-a service on praseodymium is OK: OK - cassandra-a is active [23:45:55] we are load testing brotli compression [23:47:09] RECOVERY - restbase endpoints health on cerium is OK: All endpoints are healthy [23:47:16] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [23:47:38] RECOVERY - restbase endpoints health on praseodymium is OK: All endpoints are healthy [23:47:57] RECOVERY - cassandra-a CQL 10.64.16.188:9042 on praseodymium is OK: TCP OK - 0.085 second response time on port 9042 [23:50:54] Who will be SWATing? And are we going to be able to actually do it? [23:51:07] PROBLEM - puppet last run on mw2092 is CRITICAL: CRITICAL: puppet fail [23:52:27] RECOVERY - cassandra-a service on cerium is OK: OK - cassandra-a is active [23:52:35] 6operations, 7Mail: delete exim alias wikilibrary@ library@ - https://phabricator.wikimedia.org/T123666#2010214 (10Dzahn) Eliza said in http://wmf.zendesk.com/requests/10079 that this has already been done [23:52:56] PROBLEM - restbase endpoints health on praseodymium is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200): /page/html/{title} (Get html by title from storage) is CRITICAL: Test G [23:53:05] We're down? [23:53:28] RECOVERY - cassandra-a CQL 10.64.16.153:9042 on cerium is OK: TCP OK - 0.000 second response time on port 9042 [23:54:07] hoo: it's a case of "just staging" [23:54:16] PROBLEM - cassandra-a service on xenon is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [23:54:21] mutante: Really? [23:54:33] this is staging, and we are load testing [23:54:35] hoo: what makes you think otherwise? [23:54:42] mutante: Oh, hit an unrelated fatal [23:54:44] nvm [23:54:50] ok [23:55:17] PROBLEM - cassandra-a CQL 10.64.0.202:9042 on xenon is CRITICAL: Connection refused [23:55:23] (03Abandoned) 10Subramanya Sastry: parsoid-vd-client: Set screenShotDelay to 5 seconds [puppet] - 10https://gerrit.wikimedia.org/r/269314 (owner: 10Subramanya Sastry) [23:57:46] RECOVERY - cassandra-a service on xenon is OK: OK - cassandra-a is active [23:58:07] PROBLEM - cassandra-a service on praseodymium is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [23:58:16] RECOVERY - restbase endpoints health on praseodymium is OK: All endpoints are healthy [23:58:21] tgr: Around? [23:58:27] PROBLEM - cassandra-a CQL 10.64.16.188:9042 on praseodymium is CRITICAL: Connection refused [23:58:35] hoo: yes [23:58:45] tgr: https://gerrit.wikimedia.org/r/#/c/268859/1 needs to be backported [23:58:47] RECOVERY - cassandra-a CQL 10.64.0.202:9042 on xenon is OK: TCP OK - 0.000 second response time on port 9042 [23:58:57] We can't manage oauth consumers right now [23:59:24] it's scheduled for swat [23:59:29] ah, ok [23:59:34] It didn't say that on the ticket [23:59:38] nevermind, then [23:59:47] RECOVERY - cassandra-a service on praseodymium is OK: OK - cassandra-a is active