[00:14:43] RECOVERY - puppet last run on mw1241 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [01:02:36] (03PS1) 10Dereckson: Disable centralauth-rename right for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322606 (https://phabricator.wikimedia.org/T148242) [01:03:21] ^ This has been identified by MarcoAurelio as an Unbreak now! as it could disrupt matt_flaschen's maintenance script to populate local_user_id and global_user_id fields [01:04:14] But, matt didn't request that before to run the script, only to warn stewards not to rename. So the UBN! would benefit of more comments. [01:13:35] stashbot: next [01:13:35] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [01:13:46] jouncebot: next [01:13:47] In 12 hour(s) and 46 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161121T1400) [01:14:28] Dereckson: ^ 12 hours to the next SWAT window. That would be the safest time to apply that change. [01:14:36] aww [01:14:54] Reedy: I'm a chicken :) [01:15:34] (03CR) 10BryanDavis: [C: 031] Disable centralauth-rename right for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322606 (https://phabricator.wikimedia.org/T148242) (owner: 10Dereckson) [01:17:29] Dereckson: Is it being used? [01:32:37] Yes, per https://phabricator.wikimedia.org/T148242#2808928 [01:33:29] Dereckson: No, I mean, are people actually actively causing problems? [01:35:12] Per https://phabricator.wikimedia.org/T148242#2804544 it seems to be a safety measure [01:35:49] The maintenance script could miss a renamed account. [01:36:05] So we just run it again? ;) [01:36:09] Right. [01:36:19] Per Bryan then.. If people aren't actively breaking it, I think it should wait for swat [01:38:03] Ok, I decreased the task priority from UBN to high. [01:38:25] Thanks [01:38:27] I think that's fair [01:38:37] Like I say, if people were actively renaming and causing problems, then sure, deploy [01:38:41] But I think it can wait 12 hours [02:01:41] !bash Testing [02:01:42] WMFLabs: Stored quip at https://tools.wmflabs.org/bash/quip/AViEnhJ3HQCSeVEJewvU [02:17:16] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.3) (duration: 05m 46s) [02:17:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:19:23] PROBLEM - puppet last run on analytics1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:21:35] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Nov 21 02:21:35 UTC 2016 (duration 4m 19s) [02:21:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:48:05] RECOVERY - puppet last run on analytics1031 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [03:18:06] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 1800.074254 Seconds [03:18:16] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 1811.508643 Seconds [03:19:06] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 0.0 Seconds [03:19:16] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 18.742979 Seconds [03:28:46] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 673.28 seconds [03:34:46] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 239.24 seconds [04:00:46] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 25 probes of 406 (alerts on 19) - https://atlas.ripe.net/measurements/1790945/#!map [04:05:46] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 3 probes of 406 (alerts on 19) - https://atlas.ripe.net/measurements/1790945/#!map [04:32:26] PROBLEM - puppet last run on notebook1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:33:03] (03CR) 10KartikMistry: [C: 031] "@Alex, This looks good to me. This should be deployed with cxserver/deploy patch." [puppet] - 10https://gerrit.wikimedia.org/r/321860 (https://phabricator.wikimedia.org/T147634) (owner: 10Mobrovac) [05:00:26] RECOVERY - puppet last run on notebook1002 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [05:19:36] PROBLEM - puppet last run on mw1179 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:20:46] PROBLEM - puppet last run on cp1065 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:48:30] RECOVERY - puppet last run on mw1179 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [05:49:40] RECOVERY - puppet last run on cp1065 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:08:49] (03CR) 10Legoktm: [C: 032] Disable centralauth-rename right for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322606 (https://phabricator.wikimedia.org/T148242) (owner: 10Dereckson) [06:09:21] (03Merged) 10jenkins-bot: Disable centralauth-rename right for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322606 (https://phabricator.wikimedia.org/T148242) (owner: 10Dereckson) [06:11:10] !log legoktm@tin Synchronized wmf-config/: Disable centralauth-rename right for maintenance (T148242, T151155) (duration: 00m 52s) [06:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:34] T148242: Fully populate local_user_id and global_user_id fields in production - https://phabricator.wikimedia.org/T148242 [06:11:34] T151155: Suspend centralauth-rename (global rename) rights until 28 November 2016 - https://phabricator.wikimedia.org/T151155 [06:33:24] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:36:24] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [07:07:32] 06Operations, 10MediaWiki-General-or-Unknown, 10Traffic: Failure to save recent changes - https://phabricator.wikimedia.org/T150503#2810044 (10Marostegui) @Marshallsumter can this be closed then? [07:13:14] PROBLEM - puppet last run on dbstore1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:14:35] (03PS1) 10Marostegui: db-codfw.php: Repool db2070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322613 (https://phabricator.wikimedia.org/T149553) [07:16:07] (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322613 (https://phabricator.wikimedia.org/T149553) (owner: 10Marostegui) [07:16:44] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322613 (https://phabricator.wikimedia.org/T149553) (owner: 10Marostegui) [07:18:13] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2070 - T149553 (duration: 00m 49s) [07:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:35] T149553: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553 [07:37:24] PROBLEM - puppet last run on mw1207 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:41:14] RECOVERY - puppet last run on dbstore1001 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [07:46:34] PROBLEM - puppet last run on cp3040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:01:28] (03CR) 10Marostegui: [C: 031] mariadb: Update package creation method [software] - 10https://gerrit.wikimedia.org/r/322147 (https://phabricator.wikimedia.org/T127811) (owner: 10Jcrespo) [08:06:25] RECOVERY - puppet last run on mw1207 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [08:08:28] !log Deploy ALTER table db1069 commonswiki.revision - https://phabricator.wikimedia.org/T147305 [08:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:44] RECOVERY - puppet last run on cp3040 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [08:15:19] (03PS1) 10Marostegui: db-eqiad.php: Depool db1068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322615 (https://phabricator.wikimedia.org/T147305) [08:20:19] ah here we go! marostegui doing an alter table :P [08:20:27] morning :D [08:20:37] elukey: I was waiting for you now :) [08:20:43] ahahaha [08:30:27] 06Operations, 10MediaWiki-General-or-Unknown, 10Traffic: Failure to save recent changes - https://phabricator.wikimedia.org/T150503#2810098 (10elukey) 05stalled>03Resolved a:03elukey So we found several things to follow up, probably not on this task: 1) nginx proxy_request_buffering (already deployed... [08:33:05] ^ \o/ [08:42:20] (03CR) 10Alexandros Kosiaris: [C: 032] icinga: Fix the typo in check_systemd_state [puppet] - 10https://gerrit.wikimedia.org/r/322288 (owner: 10Alexandros Kosiaris) [08:42:24] (03PS2) 10Alexandros Kosiaris: icinga: Fix the typo in check_systemd_state [puppet] - 10https://gerrit.wikimedia.org/r/322288 [08:42:26] (03CR) 10Alexandros Kosiaris: [V: 032] icinga: Fix the typo in check_systemd_state [puppet] - 10https://gerrit.wikimedia.org/r/322288 (owner: 10Alexandros Kosiaris) [08:45:59] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.220 second response time [08:46:59] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.351 second response time [08:48:29] PROBLEM - Host mr1-codfw.oob is DOWN: CRITICAL - Time to live exceeded (216.117.46.36) [08:49:52] 06Operations: Job queue size growing since ~12:00 on 2016-11-19 - https://phabricator.wikimedia.org/T151196#2810146 (10Legoktm) [08:50:24] !log rolling restart of hadoop-related java daemons on analytics* hosts due to openjdk update [08:50:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:39] RECOVERY - Host mr1-codfw.oob is UP: PING OK - Packet loss = 0%, RTA = 32.12 ms [08:56:09] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 24 probes of 429 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [08:57:08] PROBLEM - puppet last run on analytics1040 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[hadoop-hdfs-datanode] [08:58:25] checking --^, for some reason the hdfs daemon didn't restart due to address already in use (the old process was stuck for some reason) [08:58:35] (and this is a worker node, not the master) [08:58:51] puppet runs all good [08:58:58] RECOVERY - puppet last run on analytics1040 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [08:59:28] PROBLEM - puppet last run on conf1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:00:38] PROBLEM - Check systemd state on cp3006 is CRITICAL: CRITICAL - bdegraded: unexpected [09:01:08] PROBLEM - Check systemd state on cp2015 is CRITICAL: CRITICAL - bdegraded: unexpected [09:01:08] PROBLEM - Check systemd state on cp3042 is CRITICAL: CRITICAL - bdegraded: unexpected [09:01:08] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 1 probes of 429 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [09:01:38] PROBLEM - Check systemd state on cp2003 is CRITICAL: CRITICAL - bdegraded: unexpected [09:02:28] PROBLEM - Check systemd state on cp4019 is CRITICAL: CRITICAL - bdegraded: unexpected [09:27:25] RECOVERY - puppet last run on conf1001 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [09:28:30] (03CR) 10Jcrespo: [C: 032 V: 032] mariadb: Update package creation method [software] - 10https://gerrit.wikimedia.org/r/322147 (https://phabricator.wikimedia.org/T127811) (owner: 10Jcrespo) [09:28:55] PROBLEM - Check systemd state on cp2009 is CRITICAL: CRITICAL - bdegraded: unexpected [09:28:56] PROBLEM - Check systemd state on cp3043 is CRITICAL: CRITICAL - bdegraded: unexpected [09:28:56] PROBLEM - Check systemd state on cp3032 is CRITICAL: CRITICAL - bdegraded: unexpected [09:29:15] PROBLEM - Check systemd state on cp4011 is CRITICAL: CRITICAL - bdegraded: unexpected [09:29:25] PROBLEM - Check systemd state on cp1059 is CRITICAL: CRITICAL - bdegraded: unexpected [09:29:45] PROBLEM - Check systemd state on cp1046 is CRITICAL: CRITICAL - bdegraded: unexpected [09:29:45] PROBLEM - Check systemd state on cp3033 is CRITICAL: CRITICAL - bdegraded: unexpected [09:29:45] PROBLEM - Check systemd state on cp3003 is CRITICAL: CRITICAL - bdegraded: unexpected [09:29:55] PROBLEM - Check systemd state on cp1051 is CRITICAL: CRITICAL - bdegraded: unexpected [09:30:15] PROBLEM - Check systemd state on cp1061 is CRITICAL: CRITICAL - bdegraded: unexpected [09:30:25] PROBLEM - Check systemd state on cp4012 is CRITICAL: CRITICAL - bdegraded: unexpected [09:30:55] PROBLEM - Check systemd state on cp3004 is CRITICAL: CRITICAL - bdegraded: unexpected [09:30:58] (03PS1) 10Jcrespo: dbtools: Drop labsdb1008 from the list of hosts [software] - 10https://gerrit.wikimedia.org/r/322619 [09:31:25] PROBLEM - Check systemd state on cp2021 is CRITICAL: CRITICAL - bdegraded: unexpected [09:31:55] PROBLEM - Check systemd state on cp4020 is CRITICAL: CRITICAL - bdegraded: unexpected [09:32:05] looking ^ [09:32:05] PROBLEM - Check systemd state on cp3005 is CRITICAL: CRITICAL - bdegraded: unexpected [09:32:05] PROBLEM - Check systemd state on labstore1002 is CRITICAL: CRITICAL - bdegraded: unexpected [09:32:09] (03CR) 10Marostegui: [C: 031] dbtools: Drop labsdb1008 from the list of hosts [software] - 10https://gerrit.wikimedia.org/r/322619 (owner: 10Jcrespo) [09:32:23] (03CR) 10jenkins-bot: [V: 04-1] dbtools: Drop labsdb1008 from the list of hosts [software] - 10https://gerrit.wikimedia.org/r/322619 (owner: 10Jcrespo) [09:32:25] PROBLEM - Check systemd state on cp1060 is CRITICAL: CRITICAL - bdegraded: unexpected [09:32:52] ema: looks like is varnishrls.service at least on one I looked, but the check is new, I think alex just activated it [09:34:05] volans: yeah [09:34:16] labstore1002 also seems to have failed units [09:35:41] akosiaris: for the ones that report unable to read output: /usr/bin/env: python3: No such file or directory [09:36:03] argh [09:36:05] also on some I've seen the file saved as .py, on others both (.py and without extension) [09:36:14] wat ? [09:36:53] ok, I 'll fallback the check to python 2 [09:36:55] on labstore1002 the failed services are cleanup-snapshots-labstore and replicate-{maps,others,tools} [09:37:14] blame paravoid btw :P [09:37:27] akosiaris: check cp2003: ls -la /usr/local/lib/nagios/plugins/check_systemd_state* [09:37:47] there are 2 identical [09:37:54] while on mw1220 there is only the .py version [09:38:13] (03CR) 10Jcrespo: [C: 032 V: 032] dbtools: Drop labsdb1008 from the list of hosts [software] - 10https://gerrit.wikimedia.org/r/322619 (owner: 10Jcrespo) [09:38:53] (03PS4) 10Jcrespo: Stagger parser cache purges to avoid lag [puppet] - 10https://gerrit.wikimedia.org/r/320928 (https://phabricator.wikimedia.org/T150124) (owner: 10Aaron Schulz) [09:39:47] (03PS1) 10Alexandros Kosiaris: Use python 2.7 for the check_systemd_state check [puppet] - 10https://gerrit.wikimedia.org/r/322620 [09:40:07] akosiaris: and as you can see for the failed ones it looks for b'degraded' instead of 'degraded' in the array and don't find it [09:40:44] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Use python 2.7 for the check_systemd_state check [puppet] - 10https://gerrit.wikimedia.org/r/322620 (owner: 10Alexandros Kosiaris) [09:41:03] let's start by falling back to 2.7 [09:41:13] strange though, require_package('python3') [09:41:17] is in the same file just above [09:41:23] modules/nrpe/manifests/systemd_scripts.pp [09:41:34] I remembered it from the code review [09:41:59] Error: /Stage[main]/Nrpe/File[/usr/local/lib/nagios/plugins/check_systemd_state.py]/content: change from {md5}32437841c3e99a778b5662a7ead0bdc4 to {md5}bfaed703b5cf1443ee6a0a90fa750531 failed: invalid byte sequence in US-ASCII [09:42:01] damn [09:42:16] damn copyright sign [09:42:41] ASCII really? [09:42:57] it's puppet, let's not go into details right now [09:43:47] (03CR) 10Jcrespo: [C: 032] Stagger parser cache purges to avoid lag [puppet] - 10https://gerrit.wikimedia.org/r/320928 (https://phabricator.wikimedia.org/T150124) (owner: 10Aaron Schulz) [09:43:50] (03PS1) 10Alexandros Kosiaris: Remove non ASCII character [puppet] - 10https://gerrit.wikimedia.org/r/322621 [09:43:53] (03PS5) 10Jcrespo: Stagger parser cache purges to avoid lag [puppet] - 10https://gerrit.wikimedia.org/r/320928 (https://phabricator.wikimedia.org/T150124) (owner: 10Aaron Schulz) [09:43:55] PROBLEM - puppet last run on cp2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/check_systemd_state.py] [09:44:17] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Remove non ASCII character [puppet] - 10https://gerrit.wikimedia.org/r/322621 (owner: 10Alexandros Kosiaris) [09:44:55] RECOVERY - Check systemd state on cp2009 is OK: OK - running: The system is fully operational [09:45:14] ah there we go... the first recovery :-) [09:45:25] (03PS6) 10Jcrespo: Stagger parser cache purges to avoid lag [puppet] - 10https://gerrit.wikimedia.org/r/320928 (https://phabricator.wikimedia.org/T150124) (owner: 10Aaron Schulz) [09:45:51] akosiaris: might be that in many hosts nrpe::systemd_scripts is not included? hence not even python3? I cannot find /usr/local/lib/nagios/plugins/check_systemd_state on mw1220 and python3 is not installed [09:45:55] RECOVERY - Check systemd state on cp4020 is OK: OK - running: The system is fully operational [09:45:55] RECOVERY - Check systemd state on cp1051 is OK: OK - running: The system is fully operational [09:45:56] RECOVERY - Check systemd state on cp3004 is OK: OK - running: The system is fully operational [09:45:56] RECOVERY - Check systemd state on cp3043 is OK: OK - running: The system is fully operational [09:45:56] RECOVERY - Check systemd state on cp3032 is OK: OK - running: The system is fully operational [09:46:04] I don't know how check_systemd_state.py got there though akosiaris :) [09:46:05] RECOVERY - Check systemd state on cp2015 is OK: OK - running: The system is fully operational [09:46:15] RECOVERY - Check systemd state on cp1061 is OK: OK - running: The system is fully operational [09:46:15] RECOVERY - Check systemd state on cp4019 is OK: OK - running: The system is fully operational [09:46:15] RECOVERY - Check systemd state on cp3042 is OK: OK - running: The system is fully operational [09:46:15] RECOVERY - Check systemd state on cp4011 is OK: OK - running: The system is fully operational [09:46:25] RECOVERY - Check systemd state on cp1060 is OK: OK - running: The system is fully operational [09:46:25] RECOVERY - Check systemd state on cp2021 is OK: OK - running: The system is fully operational [09:46:25] RECOVERY - Check systemd state on cp4012 is OK: OK - running: The system is fully operational [09:46:25] RECOVERY - Check systemd state on cp1059 is OK: OK - running: The system is fully operational [09:46:31] maybe some recursive directory resource [09:46:35] RECOVERY - Check systemd state on cp2003 is OK: OK - running: The system is fully operational [09:46:37] my money is on that right now [09:46:45] RECOVERY - Check systemd state on cp3006 is OK: OK - running: The system is fully operational [09:46:45] RECOVERY - Check systemd state on cp1046 is OK: OK - running: The system is fully operational [09:46:45] RECOVERY - Check systemd state on cp3033 is OK: OK - running: The system is fully operational [09:46:45] RECOVERY - Check systemd state on cp3003 is OK: OK - running: The system is fully operational [09:47:05] RECOVERY - Check systemd state on cp3005 is OK: OK - running: The system is fully operational [09:48:55] RECOVERY - puppet last run on cp2003 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [09:48:55] PROBLEM - puppet last run on mw1220 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 16 seconds ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/check_systemd_state.py] [09:49:00] akosiaris: I still get invalid byte sequence in US-ASCII running puppet on mw1220 [09:49:06] ^^^^ [09:49:22] volans: yeah, I think it's because it's trying to filebucket the old file [09:49:33] !log removed varnishrls.service from non-cache_text hosts [09:49:37] I 've removed it manually and rerun puppet on cp2003 and it was fixed [09:49:52] should you salt-remove /usr/local/lib/nagios/plugins/check_systemd_state.py everywhere? [09:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:53] at least my check got the varnishrls.service garbage, that's something ;-) [09:50:01] volans: yeah, I 'll do that [09:50:19] yeah that was the purpose, find unchecked/old stuff [09:50:20] :D [09:51:55] RECOVERY - puppet last run on mw1220 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:52:36] akosiaris: yes, thanks! :) [09:53:05] PROBLEM - puppet last run on analytics1036 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[hadoop-hdfs-datanode] [09:53:15] this is me again, already solved [09:53:18] super weir [09:54:13] *weird [09:55:05] RECOVERY - puppet last run on analytics1036 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [09:55:09] what's up with ms-be2025 btw ? [09:55:27] it's down for 2 days in icinga [09:56:15] akosiaris: strange thing, on mw1220 after cleaning the file it was reinstalled as is (with .py)... how is it possible? [09:56:35] volans: some file resource being ensure => directory and having recurse => true [09:56:42] I 'll find it out [09:57:00] how can this change the extension of the file? [09:57:07] it doesn't [09:57:20] it just copies the directory as is from the puppetmaster [09:57:23] maybe is just a puppet-merge that failed and one puppetmaster has a different version? [09:57:24] so it populates it 2 times [09:57:45] there is only 1 file there [09:57:54] check_systemd_state.py [09:58:35] # Have a directory with all our plugins. [09:58:36] file { '/usr/local/lib/nagios/plugins/': [09:58:36] ensure => directory, [09:58:36] owner => 'root', [09:58:36] group => 'root', [09:58:37] mode => '0555', [09:58:38] recurse => true, [09:58:47] I think that's it [09:58:55] source => 'puppet:///modules/nrpe/plugins', [09:59:05] nrpe/manifests/init.pp [09:59:20] so it get populated twice [10:00:07] in fact.. it's a race I think [10:00:13] er, no scratch that [10:00:30] the implicit dependencies guard against that [10:02:01] so I can undertand the two files now, one from source => 'puppet:///modules/nrpe/plugins', and the other one without extension from the file directive [10:02:15] for the cases with only one file [10:02:31] I think is because nrpe::systemd_scripts is not called at all [10:02:31] yes [10:02:35] otherwise python3 will be there [10:02:49] actually I hate that class [10:02:51] it's a mistake [10:02:59] I 'll refactor the entire thing [10:03:21] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322615 (https://phabricator.wikimedia.org/T147305) (owner: 10Marostegui) [10:03:31] * akosiaris decided to implement a simple check and now is burdened with an entire refactoring [10:03:38] lol [10:03:45] is not always like that with puppet? :-P [10:03:49] true [10:03:52] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322615 (https://phabricator.wikimedia.org/T147305) (owner: 10Marostegui) [10:04:13] btw, if you could take into account also to try to keep the files with extensions in puppet and without on the hosts, that would be great ;) [10:05:49] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1068 - T147305 (duration: 00m 48s) [10:06:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:10] T147305: Unify commonswiki.revision - https://phabricator.wikimedia.org/T147305 [10:11:07] !log Deploy ALTER table db1068 commonswiki.revision - https://phabricator.wikimedia.org/T147305 [10:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:49] !log blah blah blah [10:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:51] I will clean that [10:14:59] I have cleaned twitter too [10:29:04] !log performing blocking schema change on db2065 T151029 [10:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:25] T151029: duplicate key problems on s4 - https://phabricator.wikimedia.org/T151029 [10:31:25] PROBLEM - Check systemd state on labstore1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:33:25] RECOVERY - Check systemd state on labstore1004 is OK: OK - running: The system is fully operational [10:39:10] (03PS1) 10Alexandros Kosiaris: Move the check_systemd_state check to base [puppet] - 10https://gerrit.wikimedia.org/r/322627 [10:39:37] (03PS2) 10Alexandros Kosiaris: Move the check_systemd_state check to base [puppet] - 10https://gerrit.wikimedia.org/r/322627 [10:39:40] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Move the check_systemd_state check to base [puppet] - 10https://gerrit.wikimedia.org/r/322627 (owner: 10Alexandros Kosiaris) [10:42:25] PROBLEM - Check systemd state on ganeti1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:42:45] PROBLEM - Check systemd state on dbproxy1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:42:45] PROBLEM - Check systemd state on ganeti2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:43:25] PROBLEM - Check systemd state on mw2080 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:44:25] PROBLEM - Check systemd state on ganeti1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:44:25] PROBLEM - Check systemd state on mw2083 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:44:25] PROBLEM - Check systemd state on mw2082 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:44:35] PROBLEM - Check systemd state on copper is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:44:45] sigh [10:44:49] what did I do again ... [10:45:05] PROBLEM - Check systemd state on ganeti2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:45:39] oh, nothing.. those are legitimate alerts [10:46:35] PROBLEM - Check systemd state on mw2084 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:46:45] PROBLEM - Check systemd state on db1042 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:47:25] PROBLEM - Check systemd state on labstore2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:47:55] PROBLEM - Check systemd state on mw2085 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:48:35] RECOVERY - Check systemd state on copper is OK: OK - running: The system is fully operational [10:50:50] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2810289 (10Gilles) There is contradiction here. You want to serve what's best for the client "automatically", but you want the client to retain some co... [10:51:05] PROBLEM - Check systemd state on dbproxy1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:51:15] PROBLEM - Check systemd state on db1081 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:52:15] PROBLEM - Check systemd state on dbproxy1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:52:45] PROBLEM - Check systemd state on ganeti2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:52:55] PROBLEM - Check systemd state on db2011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:52:55] PROBLEM - Check systemd state on labstore2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:53:05] PROBLEM - Check systemd state on kubernetes1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:53:25] PROBLEM - Check systemd state on ganeti2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:53:45] PROBLEM - Check systemd state on heze is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:53:55] PROBLEM - Check systemd state on labstore2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:53:55] PROBLEM - Check systemd state on dbproxy1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:54:25] PROBLEM - Check systemd state on mw2081 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:54:42] 06Operations, 06DC-Ops: ms-be2025 controller failure - https://phabricator.wikimedia.org/T151201#2810291 (10akosiaris) [10:55:26] ACKNOWLEDGEMENT - Host ms-be2025 is DOWN: PING CRITICAL - Packet loss = 100% alexandros kosiaris https://phabricator.wikimedia.org/T151201 [10:57:05] RECOVERY - Check systemd state on kubernetes1001 is OK: OK - running: The system is fully operational [10:57:45] RECOVERY - Check systemd state on heze is OK: OK - running: The system is fully operational [10:58:55] RECOVERY - Check systemd state on labstore2004 is OK: OK - running: The system is fully operational [10:59:05] RECOVERY - Check systemd state on dbproxy1006 is OK: OK - running: The system is fully operational [10:59:15] RECOVERY - Check systemd state on dbproxy1009 is OK: OK - running: The system is fully operational [10:59:15] RECOVERY - Check systemd state on db1081 is OK: OK - running: The system is fully operational [10:59:25] RECOVERY - Check systemd state on labstore2003 is OK: OK - running: The system is fully operational [10:59:45] RECOVERY - Check systemd state on dbproxy1010 is OK: OK - running: The system is fully operational [10:59:45] RECOVERY - Check systemd state on db1042 is OK: OK - running: The system is fully operational [10:59:55] RECOVERY - Check systemd state on db2011 is OK: OK - running: The system is fully operational [10:59:55] RECOVERY - Check systemd state on dbproxy1008 is OK: OK - running: The system is fully operational [11:00:16] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1068" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322629 [11:00:45] PROBLEM - Check systemd state on hafnium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:02:45] PROBLEM - Check systemd state on ganeti2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:03:15] PROBLEM - Check systemd state on ganeti1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:05:05] PROBLEM - Check systemd state on pybal-test2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:05:15] PROBLEM - Check systemd state on kubernetes1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:05:25] PROBLEM - Check systemd state on ganeti1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:05:55] PROBLEM - Check systemd state on cp3014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:06:59] 06Operations, 10MediaWiki-extensions-VipsScaler, 10Wikimedia-Site-requests, 13Patch-For-Review: VIPS scaled thumbnails don't have a comment with a link to the file description page - https://phabricator.wikimedia.org/T71336#2810319 (10TheDJ) @Dereckson right. I have no opinion on this. My ticket was only a... [11:07:29] (03PS1) 10Alexandros Kosiaris: ganeti: Disable the drbd service [puppet] - 10https://gerrit.wikimedia.org/r/322630 [11:08:42] (03PS2) 10Alexandros Kosiaris: ganeti: Disable the drbd service [puppet] - 10https://gerrit.wikimedia.org/r/322630 [11:08:45] PROBLEM - Check systemd state on kubernetes1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:08:46] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] ganeti: Disable the drbd service [puppet] - 10https://gerrit.wikimedia.org/r/322630 (owner: 10Alexandros Kosiaris) [11:08:55] PROBLEM - Check systemd state on cp3020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:09:55] PROBLEM - Check systemd state on ganeti2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:12:05] RECOVERY - Check systemd state on ganeti2004 is OK: OK - running: The system is fully operational [11:12:15] RECOVERY - Check systemd state on ganeti1001 is OK: OK - running: The system is fully operational [11:12:25] RECOVERY - Check systemd state on ganeti1002 is OK: OK - running: The system is fully operational [11:12:25] RECOVERY - Check systemd state on ganeti1004 is OK: OK - running: The system is fully operational [11:12:25] RECOVERY - Check systemd state on ganeti2003 is OK: OK - running: The system is fully operational [11:12:25] RECOVERY - Check systemd state on ganeti1003 is OK: OK - running: The system is fully operational [11:12:45] RECOVERY - Check systemd state on ganeti2002 is OK: OK - running: The system is fully operational [11:12:45] RECOVERY - Check systemd state on ganeti2001 is OK: OK - running: The system is fully operational [11:12:45] RECOVERY - Check systemd state on ganeti2006 is OK: OK - running: The system is fully operational [11:12:55] RECOVERY - Check systemd state on ganeti2005 is OK: OK - running: The system is fully operational [11:13:55] PROBLEM - puppet last run on californium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:15:44] (03Abandoned) 10Dereckson: Install exiv2 to mediawiki::packages::multimedia [puppet] - 10https://gerrit.wikimedia.org/r/309505 (https://phabricator.wikimedia.org/T71336) (owner: 10Dereckson) [11:18:22] 06Operations, 10MediaWiki-extensions-VipsScaler, 10Wikimedia-Site-requests, 13Patch-For-Review: VIPS scaled thumbnails don't have a comment with a link to the file description page - https://phabricator.wikimedia.org/T71336#2810336 (10Dereckson) a:05Dereckson>03None To summarize, it was offered to solv... [11:18:25] RECOVERY - Check systemd state on mw2083 is OK: OK - running: The system is fully operational [11:18:25] RECOVERY - Check systemd state on mw2081 is OK: OK - running: The system is fully operational [11:18:25] RECOVERY - Check systemd state on mw2082 is OK: OK - running: The system is fully operational [11:18:25] RECOVERY - Check systemd state on mw2080 is OK: OK - running: The system is fully operational [11:18:35] RECOVERY - Check systemd state on mw2084 is OK: OK - running: The system is fully operational [11:18:55] RECOVERY - Check systemd state on mw2085 is OK: OK - running: The system is fully operational [11:25:01] (03PS3) 10Aklapper: Phab: Remove custom UI string translations not in use anymore [puppet] - 10https://gerrit.wikimedia.org/r/322144 [11:25:55] RECOVERY - Check systemd state on cp3014 is OK: OK - running: The system is fully operational [11:26:22] PROBLEM - Check systemd state on mw2081 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:26:26] 06Operations: Job queue size growing since ~12:00 on 2016-11-19 - https://phabricator.wikimedia.org/T151196#2810344 (10Ankry) May it be related to recent high bot activity? After loweing the activity ~twice jobs No. seems to be decreasing slowly (top value was ~1050000): https://commons.wikimedia.org/w/api.php?... [11:27:02] RECOVERY - Check systemd state on cp3020 is OK: OK - running: The system is fully operational [11:27:42] RECOVERY - Check systemd state on hafnium is OK: OK - running: The system is fully operational [11:29:00] 06Operations, 06Performance-Team, 10Thumbor: Thumbor should reject some malformed thumbnail URLs - https://phabricator.wikimedia.org/T150749#2810345 (10Gilles) [11:32:41] 06Operations: Job queue size growing since ~12:00 on 2016-11-19 - https://phabricator.wikimedia.org/T151196#2810146 (10akosiaris) Looking at some of the jobrunners I get this ``` $ sudo journalctl -ru jobchron.service Nov 21 11:23:51 mw2081 jobchron[30826]: [Mon Nov 21 11:23:51 2016] [hphp] [30826:7f3d9d5e7100:... [11:32:48] !log starting cluster restart on elasticsearch eqiad for JVM upgrade [11:33:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:22] RECOVERY - Check systemd state on mw2081 is OK: OK - running: The system is fully operational [11:40:52] RECOVERY - puppet last run on californium is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [11:45:32] PROBLEM - Check systemd state on mw2080 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:46:12] RECOVERY - Check systemd state on pybal-test2003 is OK: OK - running: The system is fully operational [11:46:22] PROBLEM - Check systemd state on mw2083 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:46:22] PROBLEM - Check systemd state on mw2082 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:48:32] PROBLEM - Check systemd state on mw2084 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:48:52] PROBLEM - Check systemd state on mw2085 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:54:09] mmmm jobcron/runner on --^ seems down [11:54:22] RECOVERY - Check systemd state on mw2083 is OK: OK - running: The system is fully operational [11:54:31] just restarted --^ [11:55:22] RECOVERY - Check systemd state on mw2082 is OK: OK - running: The system is fully operational [11:55:52] RECOVERY - Check systemd state on mw2085 is OK: OK - running: The system is fully operational [11:56:25] PROBLEM - Check systemd state on mw2081 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:56:35] RECOVERY - Check systemd state on mw2084 is OK: OK - running: The system is fully operational [11:56:43] !log restarted jobchron/runner on mw208[0-5] since systemd was reporting degradation (broken pipes in the journald logs) [11:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:25] RECOVERY - Check systemd state on mw2081 is OK: OK - running: The system is fully operational [11:58:00] but I have no idea what should be the ideal state in codfw [11:58:18] moreover, a puppet run didn't restart the job* daemons [11:58:25] RECOVERY - Check systemd state on mw2080 is OK: OK - running: The system is fully operational [11:59:10] any way of requesting a dashboard on https://grafana.wikimedia.org (if it looks at edit filter stats)? [11:59:49] (03PS1) 10Alexandros Kosiaris: ganeti: Disable KSM [puppet] - 10https://gerrit.wikimedia.org/r/322633 [12:02:48] myrcx: do you have access to grafana-admin.wikimedia.org ? If yes you should be able to create one your own. Otherwise, file a task in phabricator. Someone will pick it and implement it if deemed useful indeed [12:03:01] one on your own* [12:04:03] akosiaris: I very much doubt I do :) I'll file a phab task, is there a tag/project for grafana? [12:04:31] graphite would be the closest probably [12:04:52] sounds good, thank you [12:04:55] but if it is something that is related to a team, adding the team would be even better [12:05:08] not sure what you want graphed though [12:05:44] elukey: that broken pipe thing has me pondering https://phabricator.wikimedia.org/T151196 [12:07:14] akosiaris: ah snap sorry! [12:07:14] akosiaris: thought it would be somewhat informative to have number of abusefilter hits either graphed or added to https://grafana.wikimedia.org/dashboard/db/abuse - be interesting as an EFM to see how many get hit a minute etc [12:07:16] (03PS2) 10Alexandros Kosiaris: ganeti: Disable KSM [puppet] - 10https://gerrit.wikimedia.org/r/322633 [12:07:24] 06Operations, 10ops-codfw, 10DBA: db2041: Disk RAID predictive failure - https://phabricator.wikimedia.org/T151203#2810386 (10Marostegui) [12:07:24] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] ganeti: Disable KSM [puppet] - 10https://gerrit.wikimedia.org/r/322633 (owner: 10Alexandros Kosiaris) [12:07:54] elukey: I don't think there is something to be sorry about. I am not sure why the jobrunners in codfw are not running. [12:14:36] akosiaris: I thought that you put them down on purpose for a test, now I've read the task again and I got it :) [12:15:26] PROBLEM - Check systemd state on mw2080 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:16:25] PROBLEM - Check systemd state on mw2083 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:16:25] PROBLEM - Check systemd state on mw2082 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:17:26] !log Stopping MySQL db1095 - maintenance - T150960 [12:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:49] T150960: Initial data tests for db1095 - https://phabricator.wikimedia.org/T150960 [12:18:35] PROBLEM - Check systemd state on mw2084 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:18:37] jouncebot: next [12:18:37] In 1 hour(s) and 41 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161121T1400) [12:19:55] PROBLEM - Check systemd state on mw2085 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:26:25] PROBLEM - Check systemd state on mw2081 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:42:48] (03PS1) 10Hashar: udp2log: prevent Ganglia install when it is not used [puppet] - 10https://gerrit.wikimedia.org/r/322639 (https://phabricator.wikimedia.org/T151169) [12:45:55] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:46:55] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [12:47:04] (03CR) 10Hashar: "Unlocks puppet for deployment-fluorine02.deployment-prep.eqiad.wmflabs. I can't tell what is going to be the result on production though." [puppet] - 10https://gerrit.wikimedia.org/r/322639 (https://phabricator.wikimedia.org/T151169) (owner: 10Hashar) [12:51:15] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:51:51] !log Restarted udp2log-mw service on deployment-fluorine02 . Was not available (ping T146723 T151169) [12:52:05] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [12:52:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:14] T146723: deployment-fluorine02 does not have logs - https://phabricator.wikimedia.org/T146723 [12:52:14] T151169: deployment-fluorine02 puppet broken - https://phabricator.wikimedia.org/T151169 [12:52:55] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:53:09] 06Operations: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728#2810467 (10hashar) [12:53:55] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [12:54:00] hashar: is everything going to jessie? It seems all are updating [13:08:37] 06Operations, 06Performance-Team, 10Thumbor: Implement rate limiter - https://phabricator.wikimedia.org/T151067#2810509 (10Gilles) The current rate-limiter in Mediawiki works per IP and per user. It honors XFF headers provided by Swift. It only applies to when an image needs to be rendered, not when it alrea... [13:10:39] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1068" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322629 (owner: 10Marostegui) [13:11:29] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1068" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322629 (owner: 10Marostegui) [13:12:15] PROBLEM - Check systemd state on cp3042 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:12:45] RECOVERY - Check systemd state on kubernetes1003 is OK: OK - running: The system is fully operational [13:13:42] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1068 - T147305 (duration: 00m 49s) [13:14:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:14] T147305: Unify commonswiki.revision - https://phabricator.wikimedia.org/T147305 [13:15:15] RECOVERY - Check systemd state on cp3042 is OK: OK - running: The system is fully operational [13:17:15] (03CR) 10Hashar: [C: 04-1] contint: basic role class for logstash [puppet] - 10https://gerrit.wikimedia.org/r/322488 (owner: 10Hashar) [13:17:45] PROBLEM - puppet last run on cp3042 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[generate_varnishkafka_statsv_gmond_pyconf] [13:19:20] (03Abandoned) 10Hashar: logstash: make statsd_host optional [puppet] - 10https://gerrit.wikimedia.org/r/322486 (owner: 10Hashar) [13:22:44] 06Operations, 06Performance-Team, 10Thumbor: Investigate differences in status codes between thumbor and image scalers - https://phabricator.wikimedia.org/T150641#2810539 (10Gilles) [13:22:46] 06Operations, 06Performance-Team, 10Thumbor: Investigate SVG default language behavior on non-English wikis for Thumbor - https://phabricator.wikimedia.org/T150743#2810537 (10Gilles) 05Open>03Resolved This works fine: https://upload.wikimedia.org/wikipedia/fr/thumb/4/45/Speech_bubbles_test2.svg/langfr-12... [13:24:54] 06Operations, 06Performance-Team, 10Thumbor: Thumbor should handle page redirects like Mediawiki does - https://phabricator.wikimedia.org/T148410#2810546 (10Gilles) [13:25:28] 06Operations, 06Performance-Team, 10Thumbor: Thumbor should handle page redirects like Mediawiki does - https://phabricator.wikimedia.org/T148410#2722010 (10Gilles) [13:25:31] 06Operations, 06Performance-Team, 10Thumbor: Thumbor/rewrite.py should check for redirects regardless of status code of the response - https://phabricator.wikimedia.org/T150751#2810550 (10Gilles) [13:25:45] PROBLEM - puppet last run on cp3039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:26:52] (03Abandoned) 10Gilles: Add mtail program to track thumbor OOM kills [puppet] - 10https://gerrit.wikimedia.org/r/315272 (https://phabricator.wikimedia.org/T148962) (owner: 10Gilles) [13:29:40] (03PS3) 10Hashar: Remove role::beta::uploadservice [puppet] - 10https://gerrit.wikimedia.org/r/322403 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [13:29:42] (03PS2) 10Hashar: Remove beta::config [puppet] - 10https://gerrit.wikimedia.org/r/322406 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [13:29:44] (03PS2) 10Hashar: Remove beta::deployaccess [puppet] - 10https://gerrit.wikimedia.org/r/322407 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [13:29:46] (03PS2) 10Hashar: Remove role::beta::bastion [puppet] - 10https://gerrit.wikimedia.org/r/322404 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [13:29:48] (03PS2) 10Hashar: Remove role::beta::trebuchet_testing [puppet] - 10https://gerrit.wikimedia.org/r/322405 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [13:31:02] (03CR) 10Hashar: [C: 031] "That one was used for deployment-upload which was emulating the old media servers. Entirely replaced by a Swift based infrastructure simil" [puppet] - 10https://gerrit.wikimedia.org/r/322403 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [13:32:26] (03CR) 10Hashar: [C: 031] "The role is no more applied anywhere and I confirmed the cron entry is not present on deployment-tin." [puppet] - 10https://gerrit.wikimedia.org/r/322404 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [13:33:22] (03CR) 10Hashar: [C: 031] "The deployment servers such as deployment-tin have /srv/deployment/test/testrepo/ which I guess is provisioned via hieradata/common/role/d" [puppet] - 10https://gerrit.wikimedia.org/r/322405 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [13:35:15] (03CR) 10Hashar: "Apparently no more used. Would like Tyler to confirm, I think he knows more about it and how scap might rely on those parameters." [puppet] - 10https://gerrit.wikimedia.org/r/322406 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [13:36:44] (03CR) 10Hashar: [C: 04-1] "From T121721 that has been removed with https://gerrit.wikimedia.org/r/#/c/313903/ and apparently reverted later. Seems it is still being " [puppet] - 10https://gerrit.wikimedia.org/r/322407 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [13:36:52] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2810564 (10Gilles) [13:36:55] 06Operations, 06Performance-Team, 10Thumbor: Record OOM kills as a metric with mtail - https://phabricator.wikimedia.org/T148962#2810563 (10Gilles) [13:42:18] 06Operations, 06Performance-Team, 10Thumbor: Thumbor 404s on a number of images Mediawiki is successful with - https://phabricator.wikimedia.org/T150760#2810580 (10Gilles) It might be two issues, some requests seem to have special characters in them, and others have /thumb/temp. [13:43:37] 06Operations, 06Performance-Team, 10Thumbor: Thumbor 404s on a number of images Mediawiki is successful with - https://phabricator.wikimedia.org/T150760#2810581 (10Gilles) @fgiunchedi Anything special about /thumb/temp? [13:45:45] RECOVERY - puppet last run on cp3042 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [13:47:17] jouncebot: next [13:47:18] In 0 hour(s) and 12 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161121T1400) [13:47:32] zeljkof: swat is straightforward today :} [13:48:02] hashar: easy enough even I could do it? ;) [13:48:41] I though you will be at the co-working space all day, should I do swat? [13:49:49] one already merged ( https://gerrit.wikimedia.org/r/#/c/322606/ ) [13:49:59] I got hosted by a trendy startup :D [13:50:47] hashar: cool [13:50:56] https://pbs.twimg.com/media/CtmmgeHXEAA4oXo.jpg :D [13:50:59] should I deploy the merged one the first them? 322606 [13:51:10] rebasing the patches [13:51:24] hashar: oh, the place where people dance on the tables? [13:51:34] kind of :} [13:51:40] (03PS2) 10Hashar: Set $wgForeignUploadTargets to [ 'local' ] for zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322251 (https://phabricator.wikimedia.org/T139257) (owner: 10Zhuyifei1999) [13:51:42] (03PS2) 10Hashar: Configure Babel for fr.wikibooks and fr.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322493 (https://phabricator.wikimedia.org/T146213) (owner: 10Dereckson) [13:52:25] their product is open source and written in PHP. So we have a lot in common [13:52:29] patches rebased [13:53:45] RECOVERY - puppet last run on cp3039 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [13:56:25] hashar: are you doing the swat? or me? [13:56:27] * zeljkof is confused [13:56:55] !log performing blocking schema change on db2058 T151029 [13:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:17] T151029: duplicate key problems on s4 - https://phabricator.wikimedia.org/T151029 [14:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161121T1400). [14:00:04] MatmaRex and Dereckson: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [14:02:01] 06Operations, 06Performance-Team, 10Thumbor: Thumbor 404s on a number of images Mediawiki is successful with - https://phabricator.wikimedia.org/T150760#2810598 (10Gilles) Ah, I guess temp goes to a different container? [14:02:48] 06Operations, 06Performance-Team, 10Thumbor: Thumbor 404s on a number of images Mediawiki is successful with - https://phabricator.wikimedia.org/T150760#2810599 (10Gilles) And it's authenticated? :-/ [14:03:31] hashar: ok, for the record, are you doing the swat, or me? [14:03:46] MatmaRex, Dereckson: around for EU SWAT? [14:03:52] hi [14:05:52] I will wait a few more minutes for the confirmation from hashar that he is doing the swat, if there are no reply, I will start with the swat [14:06:09] I would like to avoid both of us deploying at the same time :) [14:06:34] zeljkof: do it :) [14:06:42] hashar: ok [14:06:50] I can SWAT today! :D [14:07:09] [= [14:09:11] Dereckson: around for swat? [14:09:46] MatmaRex: I am starting with 322251, can you test it at mw1099, once it is there? [14:10:12] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322251 (https://phabricator.wikimedia.org/T139257) (owner: 10Zhuyifei1999) [14:10:34] zeljkof: yeah [14:11:14] MatmaRex: great, will ping you in a few minutes, as soon as it is there [14:12:35] RECOVERY - Check systemd state on kubernetes1004 is OK: OK - running: The system is fully operational [14:13:04] Hi zeljkof , yes I'm. Please note the CentralAuth rename change has already been deployed [14:13:34] Dereckson: 322606 is deployed? or just merged? [14:13:46] deployed [14:14:00] Dereckson: ok, can you remove it from the calendar then, please [14:14:13] (03Merged) 10jenkins-bot: Set $wgForeignUploadTargets to [ 'local' ] for zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322251 (https://phabricator.wikimedia.org/T139257) (owner: 10Zhuyifei1999) [14:14:33] Nope, I'm not really the one who decided not to wait to deploy it, after we had a discussion in the night to wait for SWAT. [14:15:51] MatmaRex: 322251 is at mw1099, please test [14:16:19] Dereckson: ok, so somebody deployed 322606 outside of agreed swat? [14:16:36] thanks for letting me know, will skip it [14:16:53] zeljkof: doing, give me a minute :) [14:17:06] MatmaRex: sure, no rush [14:19:26] zeljkof: well, it seems to work. it tells me i can't upload files on zh.wp, which seems to be true, instead of showing the upload dialog for commons ;) [14:19:54] MatmaRex: ok, ready to upload to the intertubeverse? [14:20:02] yeah. ship it [14:20:24] sending this bold patch where no patch has gone before... [14:21:42] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:322251|Set $wgForeignUploadTargets to [ local ] for zhwiki (T139257)]] (duration: 00m 50s) [14:22:02] MatmaRex: ok, deployed, please test [14:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:05] T139257: Disable "cross-wiki upload" from zhwiki to commons - https://phabricator.wikimedia.org/T139257 [14:22:32] Dereckson: skipping 322606 since it is already deployed, as agreed earlier [14:22:37] k [14:22:40] zeljkof: all fine. thanks [14:23:06] MatmaRex: thank you with deploying with us, we hope to see you again soon ;) [14:23:31] Dereckson: can you test 322493 at mw1099, once it is there? [14:23:43] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322493 (https://phabricator.wikimedia.org/T146213) (owner: 10Dereckson) [14:23:51] \O/ [14:24:40] zeljkof: yes [14:24:47] (03Merged) 10jenkins-bot: Configure Babel for fr.wikibooks and fr.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322493 (https://phabricator.wikimedia.org/T146213) (owner: 10Dereckson) [14:24:56] 06Operations, 06Performance-Team, 10Thumbor: Thumbor 404s on a number of images Mediawiki is successful with - https://phabricator.wikimedia.org/T150760#2810634 (10Gilles) Ah, reading thumb.php it seems like it's reading the original from the public zone, not the temp zone. Seems like they are different thin... [14:25:25] Dereckson: great, will ping you in a minute or two [14:26:00] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2810635 (10Ottomata) Hm, maybe I'm getting confused. Isn't the point of hiera to be able to look up the same variable from different scopes and get dif... [14:27:15] Dereckson: 322493 is at mw1099, please test [14:27:43] * Dereckson nods. [14:34:08] (03PS3) 10Ema: cache: get rid of varnish 3 compatibility code [puppet] - 10https://gerrit.wikimedia.org/r/322252 (https://phabricator.wikimedia.org/T150660) [14:34:14] (03CR) 10Ema: [C: 032 V: 032] cache: get rid of varnish 3 compatibility code [puppet] - 10https://gerrit.wikimedia.org/r/322252 (https://phabricator.wikimedia.org/T150660) (owner: 10Ema) [14:34:15] Dereckson: just checking, still testing? [14:34:49] Yes, I try to create a case triggering a new category on both wikis. [14:35:23] Dereckson: no rush, just checking if you need more time [14:35:33] * zeljkof is checking the logs [14:37:38] (03CR) 10Ottomata: "Hm, it shouldn't be a caveat. "The production Kafka cluster" is ambiguous. Currently, we have 2 logical clusters: 'analytics' and 'main'" [puppet] - 10https://gerrit.wikimedia.org/r/320690 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata) [14:39:30] * zeljkof is back in a couple of minutes [14:39:49] works on fr.wikiversity [14:40:26] !log deleted oathauth row for wiki13 T151209 [14:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:48] T151209: Resetting 2FA for my wiki-account - https://phabricator.wikimedia.org/T151209 [14:41:02] zeljkof: works fine [14:44:49] Dereckson: deploying [14:46:00] Thanks. [14:46:41] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:322493|Configure Babel for fr.wikibooks and fr.wikiversity (T146213)]] (duration: 00m 49s) [14:46:55] Dereckson: deployed, please check [14:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:03] T146213: Please set {{#Babel}} to autocategorize on the French Wikibooks and Wikiversity - https://phabricator.wikimedia.org/T146213 [14:49:20] Works fine, thanks for the deployment. [14:50:24] zeljkof: all done? [14:50:36] all done! [14:50:43] !log EU SWAT finished! [14:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:34] nice [14:54:42] (03PS1) 10Jcrespo: Depool db1056 for schema change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322655 (https://phabricator.wikimedia.org/T151029) [14:59:41] Krinkle: ori, fyi, i'm going to do a statsv deploy shortly [14:59:51] with new kafka client and some argparse stuff [15:08:20] (03PS1) 10Giuseppe Lavagetto: Modify Dockerfile, build to meet WMF requirements [calico-k8s-policy-controller] - 10https://gerrit.wikimedia.org/r/322660 [15:08:28] (03PS1) 10Ottomata: Puppetization for statsv args added in https://gerrit.wikimedia.org/r/#/c/321911/ [puppet] - 10https://gerrit.wikimedia.org/r/322661 (https://phabricator.wikimedia.org/T150765) [15:13:13] (03CR) 10jenkins-bot: [V: 04-1] Puppetization for statsv args added in https://gerrit.wikimedia.org/r/#/c/321911/ [puppet] - 10https://gerrit.wikimedia.org/r/322661 (https://phabricator.wikimedia.org/T150765) (owner: 10Ottomata) [15:15:48] (03CR) 10Ottomata: "Checked in: https://puppet-compiler.wmflabs.org/4615/hafnium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/322661 (https://phabricator.wikimedia.org/T150765) (owner: 10Ottomata) [15:15:59] (03PS2) 10Ottomata: Puppetization for statsv args added in https://gerrit.wikimedia.org/r/#/c/321911/ [puppet] - 10https://gerrit.wikimedia.org/r/322661 (https://phabricator.wikimedia.org/T150765) [15:16:42] (03CR) 10Ottomata: [C: 032 V: 032] Puppetization for statsv args added in https://gerrit.wikimedia.org/r/#/c/321911/ [puppet] - 10https://gerrit.wikimedia.org/r/322661 (https://phabricator.wikimedia.org/T150765) (owner: 10Ottomata) [15:16:45] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [15:17:25] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 3670 bytes in 0.034 second response time [15:17:59] 06Operations, 06Performance-Team, 10Thumbor: Thumbor 404s on a number of images Mediawiki is successful with - https://phabricator.wikimedia.org/T150760#2810741 (10Gilles) It seems to be the temp container after all. This code is confusing as hell... So it would seem that not only the hash correction is need... [15:18:32] (03PS1) 10Ottomata: Revert "Puppetization for statsv args added in https://gerrit.wikimedia.org/r/#/c/321911/" [puppet] - 10https://gerrit.wikimedia.org/r/322662 [15:18:45] (03CR) 10Ottomata: "Didn't want ot merge mariadb module chaneg!" [puppet] - 10https://gerrit.wikimedia.org/r/322662 (owner: 10Ottomata) [15:18:48] (03CR) 10Ottomata: [C: 032 V: 032] "Didn't want ot merge mariadb module chaneg!" [puppet] - 10https://gerrit.wikimedia.org/r/322662 (owner: 10Ottomata) [15:19:22] jynus: yt? [15:21:15] (03PS1) 10BBlack: tlsproxy: $certs_active as a subset of $certs [puppet] - 10https://gerrit.wikimedia.org/r/322664 [15:21:17] (03PS1) 10BBlack: rename new gs certs [puppet] - 10https://gerrit.wikimedia.org/r/322665 [15:21:19] (03PS1) 10BBlack: deploy new globalsign certs as inactive [puppet] - 10https://gerrit.wikimedia.org/r/322666 [15:22:06] (03PS1) 10Ottomata: Puppetization for statsv args added in https://gerrit.wikimedia.org/r/#/c/321911/ [puppet] - 10https://gerrit.wikimedia.org/r/322668 (https://phabricator.wikimedia.org/T150765) [15:22:29] (03CR) 10Ottomata: [C: 032 V: 032] Puppetization for statsv args added in https://gerrit.wikimedia.org/r/#/c/321911/ [puppet] - 10https://gerrit.wikimedia.org/r/322668 (https://phabricator.wikimedia.org/T150765) (owner: 10Ottomata) [15:26:03] (03Draft2) 10MarcoAurelio: Re-enable 'centralauth-rename' rights as maintenance is done [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322667 (https://phabricator.wikimedia.org/T148242) [15:26:22] ^ for when it is done should be [15:28:06] (03PS3) 10MarcoAurelio: Re-enable 'centralauth-rename' rights for when maintenance is done [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322667 (https://phabricator.wikimedia.org/T148242) [15:29:24] (03CR) 10MarcoAurelio: Re-enable 'centralauth-rename' rights for when maintenance is done (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322667 (https://phabricator.wikimedia.org/T148242) (owner: 10MarcoAurelio) [15:29:27] (03PS1) 10BBlack: add new fake entries for new unified keys [labs/private] - 10https://gerrit.wikimedia.org/r/322675 [15:29:55] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2810764 (10akosiaris) The point of hiera is to separate the code from the (configuration) code from the (configuration) data, minimizing code duplicatio... [15:35:32] (03PS4) 10MarcoAurelio: Re-enable 'centralauth-rename' rights for when maintenance is done [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322667 (https://phabricator.wikimedia.org/T148242) [15:38:44] (03PS1) 10Ottomata: Increase statsv watchdog systemd timeout to 2 minutes [puppet] - 10https://gerrit.wikimedia.org/r/322678 [15:39:12] (03CR) 10Ottomata: [C: 032 V: 032] Increase statsv watchdog systemd timeout to 2 minutes [puppet] - 10https://gerrit.wikimedia.org/r/322678 (owner: 10Ottomata) [15:40:11] (03PS5) 10MarcoAurelio: Re-enable 'centralauth-rename' rights for when maintenance is done [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322667 (https://phabricator.wikimedia.org/T148242) [15:41:06] (03CR) 10BBlack: [C: 032 V: 032] add new fake entries for new unified keys [labs/private] - 10https://gerrit.wikimedia.org/r/322675 (owner: 10BBlack) [15:41:35] (03PS6) 10MarcoAurelio: Re-enable 'centralauth-rename' rights for when maintenance is done [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322667 (https://phabricator.wikimedia.org/T148242) [15:41:38] (03PS2) 10BBlack: tlsproxy: $certs_active as a subset of $certs [puppet] - 10https://gerrit.wikimedia.org/r/322664 [15:41:40] (03PS2) 10BBlack: rename new gs certs [puppet] - 10https://gerrit.wikimedia.org/r/322665 [15:41:42] (03PS2) 10BBlack: deploy new globalsign certs as inactive [puppet] - 10https://gerrit.wikimedia.org/r/322666 [15:42:58] 06Operations, 06Performance-Team, 10Thumbor: Thumbor 404s on a number of images Mediawiki is successful with - https://phabricator.wikimedia.org/T150760#2810792 (10Gilles) @fgiunchedi attempting to get an object from the temp container with the mw:thumbor credentials gives me a 403: ``` swiftclient.excepti... [15:43:09] 06Operations, 10ops-codfw, 10DBA: install new disks into dbstore2001 - https://phabricator.wikimedia.org/T149457#2810793 (10Papaul) 05Open>03Resolved The old disks are wiped. Good to close this task. [15:43:27] 06Operations, 10ops-codfw, 10DBA: db2041: Disk RAID predictive failure - https://phabricator.wikimedia.org/T151203#2810797 (10Papaul) p:05Triage>03Normal [15:43:34] (03CR) 10BBlack: [C: 032] tlsproxy: $certs_active as a subset of $certs [puppet] - 10https://gerrit.wikimedia.org/r/322664 (owner: 10BBlack) [15:43:37] !log Shutting down db2034 for HW maintenance - T149553 [15:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:59] T149553: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553 [15:44:48] (03CR) 10BBlack: [C: 032] rename new gs certs [puppet] - 10https://gerrit.wikimedia.org/r/322665 (owner: 10BBlack) [15:46:18] (03CR) 10Alexandros Kosiaris: "Hm, so this is becoming confusing. So we got a cluster we name "main" one, but it's clearly small, so it kind of contradicts the name." [puppet] - 10https://gerrit.wikimedia.org/r/320690 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata) [15:47:23] (03CR) 10Marostegui: [C: 031] Depool db1056 for schema change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322655 (https://phabricator.wikimedia.org/T151029) (owner: 10Jcrespo) [15:49:01] (03CR) 10Jcrespo: [C: 032] Depool db1056 for schema change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322655 (https://phabricator.wikimedia.org/T151029) (owner: 10Jcrespo) [15:50:46] (03PS2) 10Hashar: contint: ElasticSearch role for build logs [puppet] - 10https://gerrit.wikimedia.org/r/322488 (https://phabricator.wikimedia.org/T78705) [15:51:41] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2810806 (10GWicke) I don't see the search for the right granularity as a contradiction, but more as a search for the best set of trade-offs along a con... [15:51:47] (03PS3) 10BBlack: deploy new globalsign certs as inactive [puppet] - 10https://gerrit.wikimedia.org/r/322666 [15:51:49] (03PS1) 10BBlack: Revert "GlobalSign G2 intermediate, signed by R3" [puppet] - 10https://gerrit.wikimedia.org/r/322683 (https://phabricator.wikimedia.org/T148045) [15:52:13] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1056 T151029 (duration: 00m 53s) [15:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:34] T151029: duplicate key problems on s4 - https://phabricator.wikimedia.org/T151029 [15:52:47] 06Operations, 10ops-codfw, 10DBA: Several es20XX servers keep crashing (es2017, es2019, es2015, es2014) since 23 March - https://phabricator.wikimedia.org/T130702#2810811 (10Papaul) [15:52:49] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: es2015 crashed with no os logs (kernel logs or other software ones) - it shuddenly went down - https://phabricator.wikimedia.org/T147769#2810809 (10Papaul) 05Open>03Resolved It has been a month now this system hasn't reported the same error after sw... [15:55:24] (03PS4) 10BBlack: deploy new globalsign certs as inactive [puppet] - 10https://gerrit.wikimedia.org/r/322666 [15:55:26] (03PS2) 10BBlack: Revert "GlobalSign G2 intermediate, signed by R3" [puppet] - 10https://gerrit.wikimedia.org/r/322683 (https://phabricator.wikimedia.org/T148045) [15:56:09] (03CR) 10BBlack: [C: 032 V: 032] Revert "GlobalSign G2 intermediate, signed by R3" [puppet] - 10https://gerrit.wikimedia.org/r/322683 (https://phabricator.wikimedia.org/T148045) (owner: 10BBlack) [15:58:12] !log Shutting down MySQL es2019 for HW maintenance - T149526 [15:58:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:33] T149526: es2019 crashed again - https://phabricator.wikimedia.org/T149526 [15:59:31] !log Powering off es2019 for HW maintenance - T149526 [15:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:41] (03PS1) 10Giuseppe Lavagetto: profile::calico::builder: add calico-cni and k8s-policy-controller [puppet] - 10https://gerrit.wikimedia.org/r/322686 [16:03:43] (03CR) 10BBlack: [C: 032] deploy new globalsign certs as inactive [puppet] - 10https://gerrit.wikimedia.org/r/322666 (owner: 10BBlack) [16:07:20] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "we can improve this later" [calico-cni] - 10https://gerrit.wikimedia.org/r/322273 (owner: 10Giuseppe Lavagetto) [16:07:39] !log performing blocking schema change on db1056 (depooled) T151029 [16:07:57] (03PS1) 10BBlack: tlsproxy: use certs_nginx for ssl_stapling_file [puppet] - 10https://gerrit.wikimedia.org/r/322687 [16:08:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:00] T151029: duplicate key problems on s4 - https://phabricator.wikimedia.org/T151029 [16:08:10] (03CR) 10BBlack: [C: 032 V: 032] tlsproxy: use certs_nginx for ssl_stapling_file [puppet] - 10https://gerrit.wikimedia.org/r/322687 (owner: 10BBlack) [16:09:14] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Modify Dockerfile, build to meet WMF requirements [calico-k8s-policy-controller] - 10https://gerrit.wikimedia.org/r/322660 (owner: 10Giuseppe Lavagetto) [16:10:16] (03PS1) 10Jcrespo: Revert "Depool db1056 for schema change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322688 [16:10:35] (03CR) 10Jcrespo: [C: 04-2] "Wait for alter table to finish." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322688 (owner: 10Jcrespo) [16:15:35] (03PS1) 10Ema: cache: remove persistent storage support [puppet] - 10https://gerrit.wikimedia.org/r/322691 (https://phabricator.wikimedia.org/T150660) [16:19:44] (03PS3) 10Hashar: contint: ElasticSearch role for build logs [puppet] - 10https://gerrit.wikimedia.org/r/322488 (https://phabricator.wikimedia.org/T78705) [16:21:12] 06Operations, 10ops-eqiad, 10Analytics-Cluster, 06Analytics-Kanban, 06DC-Ops: Kafka1022 needs a new disk - https://phabricator.wikimedia.org/T151028#2810879 (10Nuria) [16:22:48] 06Operations: update-ca-certificates, run via puppets sslcert module, doesn't update symlinks to replaced certificates - https://phabricator.wikimedia.org/T150058#2772912 (10hashar) Maybe similar or even a dupe of {T145609} [16:23:50] (03PS1) 10BBlack: caches: switch to new active unified TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/322694 [16:23:52] (03PS1) 10BBlack: cache_misc - switch to unified cert only [puppet] - 10https://gerrit.wikimedia.org/r/322695 [16:23:54] (03PS1) 10BBlack: remove unused r::c::ssl::misc [puppet] - 10https://gerrit.wikimedia.org/r/322696 [16:23:56] (03PS1) 10BBlack: add planet and wmfusercontent to unified SAN checks [puppet] - 10https://gerrit.wikimedia.org/r/322697 [16:25:14] (03CR) 10BBlack: [C: 04-2] "Don't deploy yet. Start time is 2016-11-21 00:00 UTC. We probably want that start time to be *at least* 24 hours in the past before depl" [puppet] - 10https://gerrit.wikimedia.org/r/322694 (owner: 10BBlack) [16:26:10] (03CR) 10Thcipriani: [C: 031] "This functionality should be replaced by role::beta::mediawiki (for mediawiki deploys) + scap::target (for scap3 targets)." [puppet] - 10https://gerrit.wikimedia.org/r/322407 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [16:29:08] (03CR) 10Alex Monk: "If something were using this, there'd be files on the filesystem to show it." [puppet] - 10https://gerrit.wikimedia.org/r/322407 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [16:29:30] cmjohnson1: ping for awareness, just in case: https://phabricator.wikimedia.org/T151028 [16:29:54] (03PS2) 10Ema: cache: remove persistent storage support [puppet] - 10https://gerrit.wikimedia.org/r/322691 (https://phabricator.wikimedia.org/T150660) [16:30:25] PROBLEM - puppet last run on analytics1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:32:42] (03CR) 10BBlack: [C: 031] cache: remove persistent storage support [puppet] - 10https://gerrit.wikimedia.org/r/322691 (https://phabricator.wikimedia.org/T150660) (owner: 10Ema) [16:32:44] (03PS1) 10Eevans: enable instance restbase2010-c.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/322698 (https://phabricator.wikimedia.org/T151086) [16:33:00] 06Operations, 10ops-eqiad, 10Analytics-Cluster, 06Analytics-Kanban, 06DC-Ops: Kafka1022 needs a new disk - https://phabricator.wikimedia.org/T151028#2810913 (10Cmjohnson) @ottomata @elukey the disk has been replaced, you will most likely need to add it back [16:33:07] (03CR) 10BBlack: [C: 031] Remove bits.wikimedia.org apache config [puppet] - 10https://gerrit.wikimedia.org/r/322420 (owner: 10Alex Monk) [16:33:10] (03CR) 10MarcoAurelio: Remove FlaggedRevs autopromotion function at eowiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322262 (https://phabricator.wikimedia.org/T150591) (owner: 10MarcoAurelio) [16:33:56] (03PS2) 10Ori.livneh: Changes to coal-web.py should refresh the coal webapp [puppet] - 10https://gerrit.wikimedia.org/r/322432 (https://phabricator.wikimedia.org/T131820) [16:34:53] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::calico::builder: add calico-cni and k8s-policy-controller [puppet] - 10https://gerrit.wikimedia.org/r/322686 (owner: 10Giuseppe Lavagetto) [16:34:58] (03PS2) 10Giuseppe Lavagetto: profile::calico::builder: add calico-cni and k8s-policy-controller [puppet] - 10https://gerrit.wikimedia.org/r/322686 [16:35:01] (03CR) 10Giuseppe Lavagetto: [V: 032] profile::calico::builder: add calico-cni and k8s-policy-controller [puppet] - 10https://gerrit.wikimedia.org/r/322686 (owner: 10Giuseppe Lavagetto) [16:35:12] (03CR) 10Thcipriani: [C: 04-1] "beta::config is also included in role::beta::trebuchet_testing -- this role no longer seems used, but should probably be deleted here." [puppet] - 10https://gerrit.wikimedia.org/r/322406 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [16:35:20] (03PS4) 10Dzahn: Phab: Remove custom UI string translations not in use anymore [puppet] - 10https://gerrit.wikimedia.org/r/322144 (owner: 10Aklapper) [16:35:55] (03PS3) 10Ema: cache: remove persistent storage support [puppet] - 10https://gerrit.wikimedia.org/r/322691 (https://phabricator.wikimedia.org/T150660) [16:36:18] (03CR) 10Dzahn: [C: 031] Phab: Remove custom UI string translations not in use anymore [puppet] - 10https://gerrit.wikimedia.org/r/322144 (owner: 10Aklapper) [16:36:28] (03PS2) 10Eevans: enable instance restbase2010-c.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/322698 (https://phabricator.wikimedia.org/T151086) [16:36:49] (03CR) 10Alex Monk: "No, that class was removed in the parent commit" [puppet] - 10https://gerrit.wikimedia.org/r/322406 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [16:36:51] (03PS3) 10Giuseppe Lavagetto: docker: apt repo before installing package [puppet] - 10https://gerrit.wikimedia.org/r/321485 (owner: 10Dduvall) [16:37:03] (03CR) 10Dzahn: "sorry, i dont know about beta roles" [puppet] - 10https://gerrit.wikimedia.org/r/322406 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [16:38:18] (03CR) 10Dzahn: [C: 031] "whenever you're ready" [puppet] - 10https://gerrit.wikimedia.org/r/322698 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [16:39:06] (03CR) 10Ema: [C: 032] cache: remove persistent storage support [puppet] - 10https://gerrit.wikimedia.org/r/322691 (https://phabricator.wikimedia.org/T150660) (owner: 10Ema) [16:39:47] (03PS4) 10Hashar: contint: ElasticSearch role for build logs [puppet] - 10https://gerrit.wikimedia.org/r/322488 (https://phabricator.wikimedia.org/T78705) [16:40:06] (03CR) 10Dzahn: "did we even experience any more of those slow downs? are we spending a lot of time on something that isn't an actual problem anymore?" [puppet] - 10https://gerrit.wikimedia.org/r/316983 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [16:40:12] 06Operations, 10ops-eqiad, 10Analytics-Cluster, 06Analytics-Kanban, 06DC-Ops: Kafka1022 needs a new disk - https://phabricator.wikimedia.org/T151028#2810982 (10Ottomata) k will look at it shortly. [16:40:55] (03Abandoned) 10Paladox: Gerrit: Enable concurrent collector [puppet] - 10https://gerrit.wikimedia.org/r/316983 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [16:43:00] (03CR) 10Dzahn: [C: 04-1] "-1 because of the "still needs testing in beta" comment until that is done" [puppet] - 10https://gerrit.wikimedia.org/r/319892 (https://phabricator.wikimedia.org/T150029) (owner: 10Reedy) [16:48:27] (03CR) 10Hashar: [C: 04-1] "Puppet pass on the instance buildlog.integration.eqiad.wmflabs with the following hiera configuration:" [puppet] - 10https://gerrit.wikimedia.org/r/322488 (https://phabricator.wikimedia.org/T78705) (owner: 10Hashar) [16:50:20] (03CR) 10Jcrespo: [C: 032] Revert "Depool db1056 for schema change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322688 (owner: 10Jcrespo) [16:50:33] 06Operations, 10Ops-Access-Requests, 06Discovery, 06Maps, and 2 others: Requesting access to analytics-privatedata-users for technical user discovery-stats - https://phabricator.wikimedia.org/T151063#2811020 (10Gehel) I think that @MaxSem proposes to move this job to the collection of standard analytics jo... [16:50:57] (03Merged) 10jenkins-bot: Revert "Depool db1056 for schema change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322688 (owner: 10Jcrespo) [16:51:24] (03PS1) 10Jcrespo: Depool db1059 to apply schema change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322699 (https://phabricator.wikimedia.org/T151029) [16:53:46] (03CR) 10Eevans: "> whenever you're ready" [puppet] - 10https://gerrit.wikimedia.org/r/322698 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [16:54:04] mutante: ready when you are [16:54:52] (03PS3) 10Ori.livneh: Changes to coal-web.py should refresh the coal webapp [puppet] - 10https://gerrit.wikimedia.org/r/322432 (https://phabricator.wikimedia.org/T131820) [16:54:57] (03CR) 10Ori.livneh: [C: 032 V: 032] Changes to coal-web.py should refresh the coal webapp [puppet] - 10https://gerrit.wikimedia.org/r/322432 (https://phabricator.wikimedia.org/T131820) (owner: 10Ori.livneh) [16:54:58] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1056 T151029 (duration: 00m 59s) [16:55:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:18] T151029: duplicate key problems on s4 - https://phabricator.wikimedia.org/T151029 [16:55:20] (03Abandoned) 10Thcipriani: Bump scap version to 3.3.1-1 [puppet] - 10https://gerrit.wikimedia.org/r/321339 (owner: 10Thcipriani) [16:55:34] (03CR) 10Marostegui: [C: 031] Depool db1059 to apply schema change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322699 (https://phabricator.wikimedia.org/T151029) (owner: 10Jcrespo) [16:58:25] RECOVERY - puppet last run on analytics1026 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [17:05:45] 06Operations, 07Availability, 07Performance: Job queue size growing since ~12:00 on 2016-11-19 - https://phabricator.wikimedia.org/T151196#2811083 (10akosiaris) [17:06:13] 06Operations, 07Availability, 07Performance: Job queue size growing since ~12:00 on 2016-11-19 - https://phabricator.wikimedia.org/T151196#2810146 (10akosiaris) The mw208X are quite probably unrelated by the way. [17:07:30] (03PS1) 10Reedy: Update trusted-xff.php per forcepoint/websense [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322702 [17:08:50] (03PS1) 10Ema: Remove varnish::apt_preferences [puppet] - 10https://gerrit.wikimedia.org/r/322703 (https://phabricator.wikimedia.org/T150660) [17:09:11] urandom: in meeting now, we can do right afterwards [17:09:43] (03CR) 10Reedy: [C: 032] Update trusted-xff.php per forcepoint/websense [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322702 (owner: 10Reedy) [17:09:46] mutante: i'm gonna have to press pause on my readiness in a few too; I'll ping you in a bit when I'm good [17:10:22] (03Merged) 10jenkins-bot: Update trusted-xff.php per forcepoint/websense [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322702 (owner: 10Reedy) [17:10:30] mutante: or you can let fly when ready, it shouldn't need any intervention, and nothing bad should happen if it fails (other than Icinga noise) [17:11:18] 06Operations, 10ops-codfw: es2019 crashed again - https://phabricator.wikimedia.org/T149526#2811095 (10Papaul) 1- Swapped CPU 2 to CPU1 2 - Update BIOS from 2.1.6 to 2.2.5 3- Clear syslog Leaving this task open for now . [17:12:16] !log reedy@tin Synchronized wmf-config/trusted-xff.php: Update for forcepoint/websense (duration: 00m 50s) [17:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:57] 06Operations, 10ops-codfw: es2019 crashed again - https://phabricator.wikimedia.org/T149526#2811132 (10Marostegui) Thanks. I have started MySQL and replication again - we will see how it behaves! [17:27:51] RECOVERY - Kafka Broker Server on kafka1022 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties [17:27:52] RECOVERY - Kafka MirrorMaker main-eqiad_to_analytics on kafka1022 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [17:29:19] !log unmasked kafka* on kafka1022 after disk swap [17:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:03] 06Operations, 10ops-eqiad, 10hardware-requests: Return wmf4747/wmf4748/wmf4749/wmf4750 to spares - https://phabricator.wikimedia.org/T146171#2811195 (10Cmjohnson) a:05Cmjohnson>03Joe All disks are wiped...assigning to @Joe [17:30:55] (03CR) 10Ottomata: "Discussed and explained a bit more in IRC. Alex and I both agree that these names are confusing. When we get around to replacing the ana" [puppet] - 10https://gerrit.wikimedia.org/r/320690 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata) [17:31:10] (03PS6) 10Ottomata: Deploy EventStreams on scb and configure LVS service in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/320690 (https://phabricator.wikimedia.org/T143925) [17:35:06] PROBLEM - puppet last run on ms-be1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:46:39] hello [17:46:51] i have a question [17:48:24] well, go ahead [17:50:51] 06Operations, 10ops-eqiad, 10DBA, 06Labs, and 3 others: Move dbproxy1010 and dbproxy1011 to labs-support network, rename them to labsdbproxy1001 and labsdbproxy1002 - https://phabricator.wikimedia.org/T149170#2811283 (10Cmjohnson) I can move these to rack c5. Can these be moved anytime or do they need to... [17:54:18] do i have to follow a specific procedure to upload a licensed image on a wikipedia page? I received an email confirmation from the publisher that authorizes us, but we don't know if there is a specific procedure to follow [17:57:07] luckyjane_: please do this: https://commons.wikimedia.org/wiki/Commons:OTRS#If_you_are_NOT_the_copyright_holder [17:57:19] 06Operations, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Asynchronous processing in production: one queue to rule them all - https://phabricator.wikimedia.org/T149408#2811346 (10Joe) [17:57:37] luckyjane_: out of curiosity, how did you make it to this IRC channel? is it mentioned in some documentation about this? it's not really the best place [17:59:56] mutante: ready (again). [18:00:10] 06Operations, 10Ops-Access-Requests, 06Discovery, 06Maps, and 2 others: Requesting access to analytics-privatedata-users for technical user discovery-stats - https://phabricator.wikimedia.org/T151063#2806341 (10Ottomata) Yeah, I wish we could do this! Due to a constraint in the way the puppet admin module... [18:01:14] i just searched for wikipedia irc channels and didn't really know which one to join [18:01:19] thank you [18:04:05] RECOVERY - puppet last run on ms-be1011 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [18:16:37] 06Operations, 10ops-codfw, 10DBA: db2041: Disk RAID predictive failure - https://phabricator.wikimedia.org/T151203#2811422 (10Papaul) a:05Papaul>03Marostegui Disk replacement complete. [18:17:24] (03PS1) 10Ori.livneh: Add X-Wikimedia-Debug config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322708 [18:17:39] bd808: ^ [18:17:58] generating it from puppet would be ideal, but too complicated to be practical [18:19:52] 06Operations, 10ops-eqiad, 10DBA, 06Labs, and 3 others: Move dbproxy1010 and dbproxy1011 to labs-support network, rename them to labsdbproxy1001 and labsdbproxy1002 - https://phabricator.wikimedia.org/T149170#2811430 (10jcrespo) They can be moved at any time, I will schedule downtime now on icinga. [18:26:11] 06Operations, 10Ops-Access-Requests, 06Discovery, 06Maps, and 2 others: Requesting access to analytics-privatedata-users for technical user discovery-stats - https://phabricator.wikimedia.org/T151063#2811449 (10MaxSem) My current plan is to redo this with standard Analytics tools, which have the required a... [18:31:10] 06Operations, 10ops-codfw, 10DBA: db2041: Disk RAID predictive failure - https://phabricator.wikimedia.org/T151203#2811460 (10Papaul) a:05Marostegui>03Papaul Wrong task the disk replacement was for db2035 not db2041. Discard this for now. [18:32:35] PROBLEM - puppet last run on mw1286 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:32:55] 06Operations, 10ops-codfw, 10DBA: db2035: RAID disk about to fail - https://phabricator.wikimedia.org/T150511#2811467 (10Papaul) Disk replacement complete. [18:35:28] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2811471 (10Gilles) If you're looking for hardware acceleration/battery saving, JPG is probably your best bet. If you want the most lightweight file, th... [18:35:38] (03CR) 10Krinkle: [C: 031] Remove bits.wikimedia.org apache config [puppet] - 10https://gerrit.wikimedia.org/r/322420 (owner: 10Alex Monk) [18:35:56] (03PS2) 10Krinkle: Remove bits.wikimedia.org apache config [puppet] - 10https://gerrit.wikimedia.org/r/322420 (https://phabricator.wikimedia.org/T107430) (owner: 10Alex Monk) [18:43:15] PROBLEM - puppet last run on dbstore1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:44:22] 06Operations, 10Ops-Access-Requests, 06Discovery, 06Maps, and 2 others: Requesting access to analytics-privatedata-users for technical user discovery-stats - https://phabricator.wikimedia.org/T151063#2811510 (10Ottomata) Ah, so you don't need stat1002 (specifically) or Hadoop. You just need access to the... [18:49:12] (03PS7) 10Ottomata: Deploy EventStreams on scb and configure LVS service in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/320690 (https://phabricator.wikimedia.org/T143925) [18:52:14] (03CR) 10Ottomata: [C: 032] Deploy EventStreams on scb and configure LVS service in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/320690 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata) [18:56:57] PROBLEM - puppet last run on scb1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Package[eventstreams/deploy] [18:58:37] PROBLEM - puppet last run on scb1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[eventstreams/deploy] [18:59:07] PROBLEM - puppet last run on scb1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[eventstreams/deploy] [18:59:07] PROBLEM - puppet last run on scb1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[eventstreams/deploy] [19:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161121T1900). Please do the needful. [19:00:05] thedj and James F: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [19:00:12] * James_F waves. [19:00:17] PROBLEM - puppet last run on scb2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[eventstreams/deploy] [19:00:27] PROBLEM - eventstreams on scb1001 is CRITICAL: connect to address 10.64.0.16 and port 8092: Connection refused [19:00:37] RECOVERY - puppet last run on mw1286 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [19:00:47] ottomata: seems like puppetfails above for scb might be your change? [19:01:13] bblack: yeah [19:01:26] I can SWAT today [19:01:28] this repo init thing doesn't seem very smooth, but there is also a typo in my scap config [19:01:29] fixing [19:01:57] thedj: ping for SWAT [19:02:12] PROBLEM - LVS HTTP IPv4 on eventstreams.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.34 and port 8092: Connection refused [19:02:38] that's me too [19:02:39] wow paged? [19:02:40] sorry [19:02:41] this one just paged... anyone looking? [19:02:47] ok ottomata :) [19:02:49] that shouldn't page (yet) at all. [19:03:00] will fix that too [19:03:05] thx [19:05:08] PROBLEM - puppet last run on scb2004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[eventstreams/deploy] [19:05:24] thcipriani: pong [19:05:34] thedj: hi :) [19:07:02] * thcipriani waits on jenkins [19:07:07] PROBLEM - puppet last run on scb2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[eventstreams/deploy] [19:07:47] PROBLEM - puppet last run on scb2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[eventstreams/deploy] [19:10:32] !log testing [19:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:07] RECOVERY - puppet last run on dbstore1001 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [19:11:57] RECOVERY - puppet last run on scb1001 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [19:13:37] RECOVERY - puppet last run on scb1004 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [19:14:07] RECOVERY - puppet last run on scb1002 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [19:14:07] RECOVERY - puppet last run on scb1003 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [19:17:22] you guys got paged on eventstreams? I didn't .... [19:17:42] hmmm [19:17:48] yeah it paged [19:17:54] bblack: yep, got paged [19:18:26] so both EU and US phones I would say, godog you moved to the other timeslot right? [19:18:52] thedj: your change is live on mw1099, check please [19:19:09] thcipriani: checking [19:19:42] I'm on 24/7 last I checked anyways, I just mute if I want to be in sleep-mode [19:20:00] volans: yeah I'm in PDT working hours now [19:21:26] thcipriani: looks good [19:21:49] thedj: ok, going live everywhere [19:22:47] 06Operations, 06Operations-Software-Development, 13Patch-For-Review, 06Services (watching): More verbose messages from service-checker-swagger - https://phabricator.wikimedia.org/T150560#2811632 (10Volans) a:03Volans [19:23:51] 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring: Deploy federation for Prometheus - https://phabricator.wikimedia.org/T150486#2811638 (10fgiunchedi) a:03fgiunchedi [19:23:56] (03PS1) 10Andrew Bogott: Add tools project filter tags to many toollabs roles [puppet] - 10https://gerrit.wikimedia.org/r/322716 [19:24:10] !log thcipriani@tin Synchronized php-1.29.0-wmf.3/extensions/MobileFrontend/resources/skins.minerva.content.styles/hacks.less: SWAT: [[gerrit:322638|Correct flex display for thumbnail contents on mobile (T150706)]] (duration: 00m 59s) [19:24:16] ^ thedj live now [19:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:32] T150706: Links in image captions are mangled - https://phabricator.wikimedia.org/T150706 [19:26:22] (03PS2) 10Andrew Bogott: Add tools project filter tags to many toollabs roles [puppet] - 10https://gerrit.wikimedia.org/r/322716 [19:26:24] (03PS1) 10Andrew Bogott: Shorten tab name for the puppet prefix tab [puppet] - 10https://gerrit.wikimedia.org/r/322717 [19:26:36] thcipriani: thx [19:26:36] James_F: your change is live on mw1099 [19:27:04] Cool. [19:27:08] thedj: you're welcome :) [19:27:37] thcipriani: Yup, works. [19:27:45] James_F: ok, fine to do a sync-dir here? [19:27:52] Yes. [19:28:08] cool, going live. [19:28:53] !log swift eqiad-prod ms-be1027 to weight 250 - T136631 [19:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:14] T136631: rack/setup/deploy ms-be102[2-7] - https://phabricator.wikimedia.org/T136631 [19:29:20] PROBLEM - eventstreams on scb2002 is CRITICAL: connect to address 10.192.48.43 and port 8092: Connection refused [19:30:00] PROBLEM - eventstreams on scb2003 is CRITICAL: connect to address 10.192.0.33 and port 8092: Connection refused [19:30:06] !log thcipriani@tin Synchronized php-1.29.0-wmf.3/extensions/VisualEditor: SWAT: [[gerrit:322693|Update VE core submodule to wmf/1.29.0-wmf.3 HEAD (68a1d94)]] (T151005) (duration: 00m 49s) [19:30:11] ^ James_F live everywhere [19:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:25] T151005: [Regression wmf.3] None of the changes made to the Page Settings are getting saved, all the changes are getting reverted to Default Settings - https://phabricator.wikimedia.org/T151005 [19:30:30] PROBLEM - eventstreams on scb1002 is CRITICAL: HTTP CRITICAL - No data received from host [19:30:30] PROBLEM - eventstreams on scb2004 is CRITICAL: connect to address 10.192.16.36 and port 8092: Connection refused [19:30:30] Cool, double-checking. [19:30:31] (03CR) 10Andrew Bogott: [C: 032] Shorten tab name for the puppet prefix tab [puppet] - 10https://gerrit.wikimedia.org/r/322717 (owner: 10Andrew Bogott) [19:30:44] (03CR) 10Andrew Bogott: [C: 032] Add tools project filter tags to many toollabs roles [puppet] - 10https://gerrit.wikimedia.org/r/322716 (owner: 10Andrew Bogott) [19:31:20] PROBLEM - eventstreams on scb1003 is CRITICAL: HTTP CRITICAL - No data received from host [19:32:00] PROBLEM - eventstreams on scb1004 is CRITICAL: HTTP CRITICAL - No data received from host [19:32:40] PROBLEM - eventstreams on scb2001 is CRITICAL: connect to address 10.192.32.132 and port 8092: Connection refused [19:33:15] (03CR) 10Ottomata: [C: 031] Refactor the parsing functions out of the main C file [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/322257 (https://phabricator.wikimedia.org/T147440) (owner: 10Elukey) [19:34:50] thcipriani: Yup, confirmed working in production. Thanks. [19:34:59] (03PS4) 10Filippo Giunchedi: Remove role::beta::uploadservice [puppet] - 10https://gerrit.wikimedia.org/r/322403 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [19:35:00] Awesome, thanks for checking (also liked the way you did the sub-submodule update, made SWAT way easier) [19:35:57] I'm so bad at flailing at submodules in a time crunch :) [19:36:46] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup restbase201[0-2] - https://phabricator.wikimedia.org/T150680#2811709 (10fgiunchedi) [19:37:15] 06Operations, 10ops-eqiad: eqiad: Rack and setup new restbase nodes - https://phabricator.wikimedia.org/T150964#2802840 (10fgiunchedi) [19:37:18] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup restbase201[0-2] - https://phabricator.wikimedia.org/T150680#2793238 (10fgiunchedi) [19:38:57] (03CR) 10Filippo Giunchedi: [C: 032] Remove role::beta::uploadservice [puppet] - 10https://gerrit.wikimedia.org/r/322403 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [19:39:34] (03PS3) 10Filippo Giunchedi: enable instance restbase2010-c.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/322698 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [19:39:54] (03CR) 10Krinkle: [C: 031] Add X-Wikimedia-Debug config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322708 (owner: 10Ori.livneh) [19:41:10] (03CR) 10Filippo Giunchedi: [C: 032] enable instance restbase2010-c.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/322698 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [19:41:36] urandom: ^ [19:42:07] godog: thank you sir! [19:42:14] (03PS1) 10Ottomata: Add eventstreams to scb node conftool configuration [puppet] - 10https://gerrit.wikimedia.org/r/322721 (https://phabricator.wikimedia.org/T143925) [19:42:35] you are very welcome [19:45:16] (03PS2) 10Ottomata: Add eventstreams to scb node conftool configuration [puppet] - 10https://gerrit.wikimedia.org/r/322721 (https://phabricator.wikimedia.org/T143925) [19:45:20] (03CR) 10Ottomata: [C: 032 V: 032] Add eventstreams to scb node conftool configuration [puppet] - 10https://gerrit.wikimedia.org/r/322721 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata) [19:47:31] thcipriani: Yeah, with VE being a special-snowflake I have pre-scripted sub-module creation command strings in a scratch pad here. :-) [19:48:58] :) [19:55:34] 06Operations, 10Ops-Access-Requests, 06Discovery, 06Maps, and 2 others: Requesting access to analytics-privatedata-users for technical user discovery-stats - https://phabricator.wikimedia.org/T151063#2811787 (10Gehel) 05Open>03Resolved a:03Gehel I'm closing this after discussion with @MaxSem. We are... [19:56:00] (03PS1) 10Andrew Bogott: role::prometheus::tools is used in the tools project. [puppet] - 10https://gerrit.wikimedia.org/r/322723 [19:57:13] (03CR) 10Andrew Bogott: [C: 032] role::prometheus::tools is used in the tools project. [puppet] - 10https://gerrit.wikimedia.org/r/322723 (owner: 10Andrew Bogott) [19:57:26] RECOVERY - puppet last run on scb2001 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [20:00:06] PROBLEM - cassandra-c CQL 10.192.16.188:9042 on restbase2010 is CRITICAL: connect to address 10.192.16.188 and port 9042: Connection refused [20:00:21] ^^^ got that [20:00:59] ACKNOWLEDGEMENT - cassandra-c CQL 10.192.16.188:9042 on restbase2010 is CRITICAL: connect to address 10.192.16.188 and port 9042: Connection refused eevans Bootstrapping [20:03:16] RECOVERY - puppet last run on scb2004 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [20:04:10] (03PS1) 10Filippo Giunchedi: prometheus: logrotate only server.log [puppet] - 10https://gerrit.wikimedia.org/r/322724 (https://phabricator.wikimedia.org/T151149) [20:05:16] RECOVERY - puppet last run on scb2003 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [20:05:45] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: logrotate only server.log [puppet] - 10https://gerrit.wikimedia.org/r/322724 (https://phabricator.wikimedia.org/T151149) (owner: 10Filippo Giunchedi) [20:05:46] RECOVERY - puppet last run on scb2002 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [20:07:16] PROBLEM - Check systemd state on scb2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:07:26] PROBLEM - Check systemd state on scb2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:11:02] akosiaris: \o/ it works [20:12:24] :-) [20:12:52] 06Operations, 13Patch-For-Review: Prometheus cronspam - https://phabricator.wikimedia.org/T151149#2811857 (10fgiunchedi) a:03fgiunchedi Should be fixed by the reviews above, waiting tomorrow to confirm, thanks! [20:16:27] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 385 bytes in 0.005 second response time [20:16:57] ^checking [20:18:12] (03PS1) 10Ottomata: Add eventstreams to list of lvs realserver ips for scb [puppet] - 10https://gerrit.wikimedia.org/r/322726 (https://phabricator.wikimedia.org/T143925) [20:18:26] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 3670 bytes in 0.056 second response time [20:20:40] (03CR) 10Alexandros Kosiaris: [C: 031] Add eventstreams to list of lvs realserver ips for scb [puppet] - 10https://gerrit.wikimedia.org/r/322726 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata) [20:21:11] (03CR) 10Ottomata: [C: 032] Add eventstreams to list of lvs realserver ips for scb [puppet] - 10https://gerrit.wikimedia.org/r/322726 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata) [20:26:50] (03PS7) 10Filippo Giunchedi: role: add prometheus 'global' instance [puppet] - 10https://gerrit.wikimedia.org/r/321814 (https://phabricator.wikimedia.org/T150486) [20:27:26] (03CR) 10Filippo Giunchedi: "Restrict to codfw for now, eqiad has less space available" [puppet] - 10https://gerrit.wikimedia.org/r/321814 (https://phabricator.wikimedia.org/T150486) (owner: 10Filippo Giunchedi) [20:33:43] (03PS1) 10Andrew Bogott: Labs enc: Don't use ldap as a source for anything [puppet] - 10https://gerrit.wikimedia.org/r/322730 (https://phabricator.wikimedia.org/T148683) [20:36:26] PROBLEM - puppet last run on ganeti1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:40:46] (03PS1) 10Ottomata: Allow lvs service monitoring to specify critical parameter for monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/322732 (https://phabricator.wikimedia.org/T143925) [20:41:41] (03PS2) 10Andrew Bogott: Labs enc: Don't use ldap as a source for anything [puppet] - 10https://gerrit.wikimedia.org/r/322730 (https://phabricator.wikimedia.org/T148683) [20:42:06] (03PS2) 10Ottomata: Allow lvs service monitoring to specify critical parameter for monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/322732 (https://phabricator.wikimedia.org/T143925) [20:46:39] (03PS5) 10Dzahn: Phab: Remove custom UI string translations not in use anymore [puppet] - 10https://gerrit.wikimedia.org/r/322144 (owner: 10Aklapper) [20:46:54] (03CR) 10Dzahn: [C: 032] Phab: Remove custom UI string translations not in use anymore [puppet] - 10https://gerrit.wikimedia.org/r/322144 (owner: 10Aklapper) [20:48:36] (03CR) 10Smalyshev: "I was under impression nobody should ever cache response with code like 429, by definition, but I'll check if it's the real state of affai" [puppet] - 10https://gerrit.wikimedia.org/r/319010 (https://phabricator.wikimedia.org/T108488) (owner: 10Smalyshev) [20:50:47] removed reference to T33 [20:50:47] T33: Phabricator should let you add dependencies both ways (depending and blocking) - https://phabricator.wikimedia.org/T33 [20:56:06] PROBLEM - puppet last run on elastic1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:57:07] 06Operations, 06Security-Team, 13Patch-For-Review: Create cronjob for regular captcha regeneration - https://phabricator.wikimedia.org/T150029#2812076 (10Reedy) [20:59:48] 06Operations, 06Security-Team, 13Patch-For-Review: Create cronjob for regular captcha regeneration - https://phabricator.wikimedia.org/T150029#2812082 (10Reedy) T151244 and https://gerrit.wikimedia.org/r/322735 for adding a --delete option to delete all the current catpchas after putting new ones in place [21:00:59] (03CR) 10Reedy: "T151244 and therefore the patch in https://gerrit.wikimedia.org/r/322735 block this now ;D" [puppet] - 10https://gerrit.wikimedia.org/r/319892 (https://phabricator.wikimedia.org/T150029) (owner: 10Reedy) [21:05:24] RECOVERY - puppet last run on ganeti1003 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [21:07:10] (03PS3) 10Andrew Bogott: Labs enc: Don't use ldap as a source for anything [puppet] - 10https://gerrit.wikimedia.org/r/322730 (https://phabricator.wikimedia.org/T148683) [21:07:12] (03PS1) 10Andrew Bogott: puppet-enc: Sort role list [puppet] - 10https://gerrit.wikimedia.org/r/322736 [21:08:45] !log starting mobileapps deploy [21:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:51] !log deployed mobileapps da269c3 [21:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:56] (03CR) 10Ori.livneh: [C: 032] Add X-Wikimedia-Debug config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322708 (owner: 10Ori.livneh) [21:13:33] (03Merged) 10jenkins-bot: Add X-Wikimedia-Debug config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322708 (owner: 10Ori.livneh) [21:14:40] (03CR) 10Andrew Bogott: [C: 032] puppet-enc: Sort role list [puppet] - 10https://gerrit.wikimedia.org/r/322736 (owner: 10Andrew Bogott) [21:15:10] ori: Is there somewhere we can leave a comment in puppet to remind people to update that file in mw-config too? [21:15:14] PROBLEM - puppet last run on db1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:52] Reedy: yes, and I will [21:15:57] <3 [21:17:32] !log ori@tin Synchronized debug.json: (no message) (duration: 01m 06s) [21:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:28] !log ori@tin Synchronized docroot/noc: (no message) (duration: 00m 55s) [21:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:04] RECOVERY - puppet last run on elastic1020 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [21:34:38] (03PS3) 10Ottomata: Allow lvs service monitoring to specify critical parameter for monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/322732 (https://phabricator.wikimedia.org/T143925) [21:44:14] RECOVERY - puppet last run on db1033 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [21:50:04] PROBLEM - puppet last run on ms-be3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:51:14] (03CR) 10Yuvipanda: [C: 031] Labs enc: Don't use ldap as a source for anything [puppet] - 10https://gerrit.wikimedia.org/r/322730 (https://phabricator.wikimedia.org/T148683) (owner: 10Andrew Bogott) [21:51:40] (03CR) 10Andrew Bogott: [C: 032] Labs enc: Don't use ldap as a source for anything [puppet] - 10https://gerrit.wikimedia.org/r/322730 (https://phabricator.wikimedia.org/T148683) (owner: 10Andrew Bogott) [21:59:04] PROBLEM - puppet last run on db1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:04:24] RECOVERY - Check systemd state on scb2003 is OK: OK - running: The system is fully operational [22:07:24] PROBLEM - Check systemd state on scb2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:11:48] 06Operations, 10ops-eqiad, 10media-storage, 13Patch-For-Review: rack/setup/deploy ms-be102[2-7] - https://phabricator.wikimedia.org/T136631#2812478 (10Cmjohnson) [22:11:50] 06Operations, 10ops-eqiad, 10media-storage: diagnose failed disks on ms-be1027 - https://phabricator.wikimedia.org/T140374#2812477 (10Cmjohnson) 05Open>03Resolved [22:13:07] 06Operations, 10ops-eqiad, 10Analytics-Cluster, 06Analytics-Kanban, 06DC-Ops: Kafka1022 needs a new disk - https://phabricator.wikimedia.org/T151028#2812484 (10Cmjohnson) 05Open>03Resolved [22:15:04] PROBLEM - Juniper alarms on asw-ulsfo.mgmt.ulsfo.wmnet is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms [22:17:04] PROBLEM - Host ripe-atlas-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [22:17:24] PROBLEM - puppet last run on bast3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:18:04] RECOVERY - puppet last run on ms-be3002 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [22:27:04] RECOVERY - puppet last run on db1043 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [22:31:48] (03PS1) 10RobH: depooling ulsfo [dns] - 10https://gerrit.wikimedia.org/r/322778 [22:33:09] (03PS2) 10RobH: depooling ulsfo [dns] - 10https://gerrit.wikimedia.org/r/322778 [22:33:38] (03CR) 10BBlack: [C: 031] depooling ulsfo [dns] - 10https://gerrit.wikimedia.org/r/322778 (owner: 10RobH) [22:33:44] (03CR) 10RobH: [C: 032] depooling ulsfo [dns] - 10https://gerrit.wikimedia.org/r/322778 (owner: 10RobH) [22:36:44] PROBLEM - puppet last run on logstash1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:36:49] !log depooled ulsfo, unitedlayer has to do an emergency replacement of a failed pdu [22:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:29] 06Operations, 10ops-ulsfo: ulsfo pdu 1.22 replacement - https://phabricator.wikimedia.org/T151263#2812572 (10RobH) [22:46:24] RECOVERY - puppet last run on bast3001 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [22:49:44] PROBLEM - puppet last run on analytics1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:50:45] (03PS1) 1020after4: Allow aklapper to `sudo -E` phabricator admin utilities [puppet] - 10https://gerrit.wikimedia.org/r/322781 (https://phabricator.wikimedia.org/T151148) [23:02:17] PROBLEM - LVS HTTP IPv4 on eventstreams.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.34 and port 8092: Connection refused [23:02:34] wah waaaah [23:02:47] again? [23:02:59] I'll silence it, I guess downtime expired [23:03:02] i bet he didnt sticky the ack so it flaps and alerts again? [23:03:11] 4h [23:03:17] default downtime I bet [23:03:54] luckily is late enough that it didn't page europe :-P [23:04:21] hehe alright silenced until dec 12th (monday) [23:04:26] ottomata: ^ [23:04:44] RECOVERY - puppet last run on logstash1004 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [23:08:36] (03PS8) 10Filippo Giunchedi: role: add prometheus 'global' instance [puppet] - 10https://gerrit.wikimedia.org/r/321814 (https://phabricator.wikimedia.org/T150486) [23:09:12] hey thanks godog yeah,i made this https://gerrit.wikimedia.org/r/#/c/322732/ [23:09:24] didn't want to merge on my eve [23:10:33] ah, that makes sense ottomata [23:10:44] PROBLEM - puppet last run on analytics1049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:16:02] (03CR) 10Filippo Giunchedi: [C: 032] role: add prometheus 'global' instance [puppet] - 10https://gerrit.wikimedia.org/r/321814 (https://phabricator.wikimedia.org/T150486) (owner: 10Filippo Giunchedi) [23:17:07] (03PS6) 10Reedy: Add cronjob for regenerating captchas [puppet] - 10https://gerrit.wikimedia.org/r/319892 (https://phabricator.wikimedia.org/T150029) [23:17:26] (03CR) 10Reedy: "PS6 adds --delete (blocked on the train deploying .4 now)" [puppet] - 10https://gerrit.wikimedia.org/r/319892 (https://phabricator.wikimedia.org/T150029) (owner: 10Reedy) [23:17:31] (03PS7) 10Reedy: Add cronjob for regenerating captchas [puppet] - 10https://gerrit.wikimedia.org/r/319892 (https://phabricator.wikimedia.org/T150029) [23:19:44] RECOVERY - puppet last run on analytics1002 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [23:20:58] (03PS1) 10Filippo Giunchedi: role: fix typo in prometheus global instance [puppet] - 10https://gerrit.wikimedia.org/r/322789 [23:24:36] (03CR) 10Filippo Giunchedi: [C: 032] role: fix typo in prometheus global instance [puppet] - 10https://gerrit.wikimedia.org/r/322789 (owner: 10Filippo Giunchedi) [23:28:44] (03PS1) 1020after4: Phabricator: Unbreak incoming email and harden config file permissions. [puppet] - 10https://gerrit.wikimedia.org/r/322791 (https://phabricator.wikimedia.org/T151229) [23:29:14] PROBLEM - High lag on wdqs1002 is CRITICAL: CRITICAL: 31.25% of data above the critical threshold [1800.0] [23:29:16] Any opsen available to look at https://gerrit.wikimedia.org/r/322791 ? [23:29:39] phabricator incoming email is broken and that should fix it. I think that makes it somewhat urgent. [23:30:10] (03CR) 10Paladox: [C: 031] Phabricator: Unbreak incoming email and harden config file permissions. [puppet] - 10https://gerrit.wikimedia.org/r/322791 (https://phabricator.wikimedia.org/T151229) (owner: 1020after4) [23:30:13] (03CR) 10jenkins-bot: [V: 04-1] Phabricator: Unbreak incoming email and harden config file permissions. [puppet] - 10https://gerrit.wikimedia.org/r/322791 (https://phabricator.wikimedia.org/T151229) (owner: 1020after4) [23:31:03] (03CR) 10Paladox: Phabricator: Unbreak incoming email and harden config file permissions. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/322791 (https://phabricator.wikimedia.org/T151229) (owner: 1020after4) [23:31:08] twentyafterfour: jenkins says no :P [23:31:14] LOL [23:31:14] RECOVERY - High lag on wdqs1002 is OK: OK: Less than 30.00% above the threshold [600.0] [23:31:15] I think you just missed duty opsen too [23:31:56] (03PS1) 10Gergő Tisza: Deploy EmailAuth to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322792 (https://phabricator.wikimedia.org/T151015) [23:32:49] (03CR) 10Reedy: [C: 031] "Looks good to go for now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322792 (https://phabricator.wikimedia.org/T151015) (owner: 10Gergő Tisza) [23:33:36] (03PS2) 1020after4: Phabricator: Unbreak incoming email and harden config file permissions. [puppet] - 10https://gerrit.wikimedia.org/r/322791 (https://phabricator.wikimedia.org/T151229) [23:33:42] fixed.. [23:33:42] twentyafterfour see if any us ops are willing to do it, or if you are up when the eu ops are around see if they will do it, the ops clinic duty person should be online by then. [23:34:04] or add it to puppet swat if no ops are available :) [23:34:11] https://gerrit.wikimedia.org/r/322791 [23:34:11] doesn't godog use the inbound email? :P [23:34:14] mutante: ^ [23:34:18] (03CR) 10Paladox: [C: 031] Phabricator: Unbreak incoming email and harden config file permissions. [puppet] - 10https://gerrit.wikimedia.org/r/322791 (https://phabricator.wikimedia.org/T151229) (owner: 1020after4) [23:34:43] problem is that I think it just siltently fails so it can easily go unnoticed [23:36:04] RECOVERY - Juniper alarms on asw-ulsfo.mgmt.ulsfo.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [23:36:12] twentyafterfour will that be broken on phab-01? [23:36:39] as it dosen't use the puppet class you fixed it in. https://gerrit.wikimedia.org/r/322791 [23:36:54] RECOVERY - Host ripe-atlas-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 78.68 ms [23:37:47] phab-01 doesn't support incoming email does it? [23:37:55] Yes it should [23:38:03] it should be able to send emails out. [23:38:04] so I live-hacked the fix on iridium, and it works...but puppet will undo my fix so ... [23:38:17] twentyafterfour disable puppet [23:38:24] !log disabling puppet on iridium until https://gerrit.wikimedia.org/r/#/c/322791/ lands [23:38:31] Thanks :) [23:38:44] RECOVERY - puppet last run on analytics1049 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [23:38:44] It seems the log bot is not working [23:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:54] Oh that took a long time to log [23:38:58] never mind [23:39:34] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 416 probes of 430 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [23:39:36] Reedy: ah? which inbound email? :) [23:39:48] godog: Don't use use email to phab? [23:41:56] it's no longer as urgent - I live-hacked the fix on iridium and disabeld puppet [23:42:06] godog: ^ [23:42:47] (03CR) 1020after4: "this is live-hacked on iridium, and puppet is disabled until this lands" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/322791 (https://phabricator.wikimedia.org/T151229) (owner: 1020after4) [23:43:09] (03PS1) 10Filippo Giunchedi: role: fix Prometheus global instance configuration [puppet] - 10https://gerrit.wikimedia.org/r/322798 [23:43:24] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 257 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [23:43:30] Reedy: ah, yeah that's right I used to to file automatic tasks [23:43:41] twentyafterfour: ok I'll take a look shortly [23:44:07] Yay, interested peoples :D [23:44:23] twentyafterfour: this probably needs an email to ops@ / wikitech-l after it's fixed so people know they might have yelled into the void :) [23:44:34] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 3 probes of 430 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [23:44:36] indeed [23:45:06] did it just discarded inbound email? [23:45:20] (03CR) 10Filippo Giunchedi: [C: 032] role: fix Prometheus global instance configuration [puppet] - 10https://gerrit.wikimedia.org/r/322798 (owner: 10Filippo Giunchedi) [23:48:14] PROBLEM - Check whether ferm is active by checking the default input chain on db1092 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:48:14] PROBLEM - Check size of conntrack table on db1092 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:48:14] PROBLEM - configured eth on db1092 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:48:27] PROBLEM - MariaDB Slave IO: s5 on db1092 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:48:27] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 9 probes of 257 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [23:48:34] PROBLEM - dhclient process on db1092 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:48:34] PROBLEM - Disk space on db1092 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:48:37] PROBLEM - MariaDB Slave SQL: s5 on db1092 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:48:43] wah waaaaaah [23:48:44] PROBLEM - puppet last run on db1092 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:48:52] eww [23:48:56] PROBLEM - mysqld processes on db1092 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:48:56] PROBLEM - DPKG on db1092 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:48:59] PROBLEM - MariaDB disk space on db1092 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:48:59] PROBLEM - salt-minion processes on db1092 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:49:01] crashed I suspect [23:49:04] PROBLEM - Check systemd state on db1092 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:49:09] yeah [23:49:13] its a slave in s5 [23:49:21] I'm here but I'm not (sick and therefore awake cause... stuff) [23:49:27] im logging into the mgmt [23:50:23] yeah, it appears hard crashed in console [23:50:27] no response via serial [23:50:28] I suppose depool it? [23:50:29] robh: ok, I've silenced it for 2h, let me know how it goes [23:50:44] (03PS1) 10BryanDavis: webservice: guard against PYTHONPATH munging in caller's environment [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/322799 (https://phabricator.wikimedia.org/T147350) [23:51:05] !log db1095 alerted icinga, non-responsive to serial console (hard crash), rebooting [23:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:53] apergos: yeah makes sense [23:52:05] damn these systems take along time to post [23:52:34] so according to https://dbtree.wikimedia.org/ nothing is sycning off it [23:52:41] im not sure you really have to depool anything [23:52:41] robh: 1092 not 1095 ;) [23:53:02] heh load was through the roof before the crash https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?var-server=db1092:9100&var-datasource=eqiad%20prometheus%2Fops&from=1479771511437&to=1479772319894 [23:53:05] !log db1092, typo in my log! [23:53:20] Just comment it out in db-eqiad? [23:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:44] RECOVERY - salt-minion processes on db1092 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:53:44] RECOVERY - DPKG on db1092 is OK: All packages OK [23:53:47] RECOVERY - MariaDB disk space on db1092 is OK: DISK OK [23:53:47] well, its back online now [23:53:54] but i dont think mysql starts automatically [23:53:54] RECOVERY - Check systemd state on db1092 is OK: OK - running: The system is fully operational [23:54:14] RECOVERY - Check whether ferm is active by checking the default input chain on db1092 is OK: OK ferm input default policy is set [23:54:14] RECOVERY - Check size of conntrack table on db1092 is OK: OK: nf_conntrack is 0 % full [23:54:14] RECOVERY - configured eth on db1092 is OK: OK - interfaces up [23:54:24] RECOVERY - dhclient process on db1092 is OK: PROCS OK: 0 processes with command name dhclient [23:54:24] RECOVERY - Disk space on db1092 is OK: DISK OK [23:54:34] RECOVERY - puppet last run on db1092 is OK: OK: Puppet is currently enabled, last run 18 minutes ago with 0 failures [23:55:10] robh: you don't want queries to go to that db, in case it's got corruption, very possible after a crash [23:55:18] ok [23:55:29] im making a task for followup so the DBAs are aware [23:55:39] what reedy says, remove (comment out) in db-eqiad [23:55:43] cool [23:57:18] email sent about inbound email outage. does this need a more formal incident report? [23:57:33] I don't actually know all of the details... [23:57:56] probably not many people use email replies to interact with phabricator? I'm not actually sure how used that feature is [23:58:26] (03PS1) 10Reedy: Comment out db1092 after crash till dba have looked at box [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322800 [23:58:40] (03PS1) 10RobH: db1092 crashed and was offline for a bit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322801 (https://phabricator.wikimedia.org/T151272) [23:58:45] oh [23:58:49] Reedy: you beat me to it ;] [23:58:52] twentyafterfour, I once used it onmobile cuz my password is a pain to type [23:58:56] by 14 seconds :D [23:59:12] reedy is fast.. [23:59:13] i'll abandon mine if you can tie yours to the task im appending to? [23:59:17] twentyafterfour: yeah an incident report would be nice, at least to get some action items to avoid silent failure [23:59:23] but which one will jenkns verify first? [23:59:40] might aswell mention the task on the line too as well as the commit message [23:59:47] 06Operations, 10DBA, 13Patch-For-Review: db1092 crash - https://phabricator.wikimedia.org/T151272#2812814 (10RobH)