[00:02:55] <wikibugs>	 (03CR) 10Chad: [C: 04-1] "Oh, hmm. For deploying with scap itself. Hmmm. Well we'd do it like we already do with id_rsa/id_rsa.pub in jetty.pp. We should be able to" [puppet] - 10https://gerrit.wikimedia.org/r/363726 (owner: 10Paladox)
[00:10:04] <wikibugs>	 (03Draft1) 10Paladox: Gerrit: Add gerrit pub key for ssh [labs/private] - 10https://gerrit.wikimedia.org/r/363755
[00:10:06] <wikibugs>	 (03PS2) 10Paladox: Gerrit: Add gerrit pub key for ssh [labs/private] - 10https://gerrit.wikimedia.org/r/363755
[00:10:50] <wikibugs>	 (03PS6) 10Paladox: WIP: Gerrit: Add support for scap [puppet] - 10https://gerrit.wikimedia.org/r/363726
[00:14:01] <wikibugs>	 10Operations: postgresql::ganglia on puppetdb servers - authentication failed - https://phabricator.wikimedia.org/T169953#3414337 (10Dzahn)
[00:18:57] <wikibugs>	 (03CR) 10Dzahn: icinga/role:mail::mx: add monitoring of exim queue size (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/361023 (https://phabricator.wikimedia.org/T133110) (owner: 10Dzahn)
[00:21:30] <wikibugs>	 (03CR) 10Dzahn: "the production private key has "content => secret('gerrit/id_rsa')," so it comes from private repo. imho the answer is to make a new key a" [puppet] - 10https://gerrit.wikimedia.org/r/363726 (owner: 10Paladox)
[00:22:42] <wikibugs>	 (03CR) 10Dzahn: "the ".pub" part goes in the public repo and the private part in labs/private" [labs/private] - 10https://gerrit.wikimedia.org/r/363755 (owner: 10Paladox)
[00:24:43] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "i would say:" [labs/private] - 10https://gerrit.wikimedia.org/r/363755 (owner: 10Paladox)
[00:27:27] <wikibugs>	 (03PS3) 10Paladox: Gerrit: Add gerrit pub key for ssh [labs/private] - 10https://gerrit.wikimedia.org/r/363755
[00:30:22] <wikibugs>	 (03PS7) 10Paladox: WIP: Gerrit: Add support for scap [puppet] - 10https://gerrit.wikimedia.org/r/363726
[00:34:23] <wikibugs>	 (03CR) 10Chad: WIP: Gerrit: Add support for scap (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/363726 (owner: 10Paladox)
[00:35:24] <wikibugs>	 (03CR) 10Paladox: WIP: Gerrit: Add support for scap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/363726 (owner: 10Paladox)
[00:35:37] <wikibugs>	 (03PS8) 10Paladox: WIP: Gerrit: Add support for scap [puppet] - 10https://gerrit.wikimedia.org/r/363726
[00:35:53] <paladox>	 sorry for alot of spam.
[00:40:16] <wikibugs>	 (03PS9) 10Paladox: WIP: Gerrit: Add support for scap [puppet] - 10https://gerrit.wikimedia.org/r/363726
[00:41:11] <wikibugs>	 (03PS5) 10Dzahn: icinga/role:mail::mx: add monitoring of exim queue size [puppet] - 10https://gerrit.wikimedia.org/r/361023 (https://phabricator.wikimedia.org/T133110)
[00:41:12] <mutante>	 adds more
[00:43:19] <wikibugs>	 (03CR) 10Dzahn: "PS5: dropped "wc" entirely, we can use "exipick -bpc" to count. | added "set -euo pipefail" as response to godog's comment." [puppet] - 10https://gerrit.wikimedia.org/r/361023 (https://phabricator.wikimedia.org/T133110) (owner: 10Dzahn)
[00:44:26] <wikibugs>	 (03CR) 10Dzahn: "though.. now that " | wc " is gone there is no pipe, so no real reason for "pipefail" either.. shrug" [puppet] - 10https://gerrit.wikimedia.org/r/361023 (https://phabricator.wikimedia.org/T133110) (owner: 10Dzahn)
[01:02:07] <icinga-wm_>	 PROBLEM - Check health of redis instance on 6380 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1499389324 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 9195778 keys, up 2 minutes 2 seconds - replication_delay is 1499389324
[01:02:17] <icinga-wm_>	 PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6480
[01:02:17] <icinga-wm_>	 PROBLEM - Check health of redis instance on 6381 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1499389331 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 9097506 keys, up 2 minutes 9 seconds - replication_delay is 1499389331
[01:02:17] <icinga-wm_>	 PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6481
[01:02:17] <icinga-wm_>	 PROBLEM - Check health of redis instance on 6379 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 1499389333 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 9192557 keys, up 2 minutes 12 seconds - replication_delay is 1499389333
[01:02:27] <icinga-wm_>	 PROBLEM - Check health of redis instance on 6379 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1499389344 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 9193379 keys, up 2 minutes 22 seconds - replication_delay is 1499389344
[01:03:07] <icinga-wm_>	 RECOVERY - Check health of redis instance on 6380 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 9188738 keys, up 3 minutes 2 seconds - replication_delay is 0
[01:03:17] <icinga-wm_>	 RECOVERY - Check health of redis instance on 6480 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 4486963 keys, up 3 minutes 7 seconds - replication_delay is 1
[01:03:17] <icinga-wm_>	 RECOVERY - Check health of redis instance on 6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 4483846 keys, up 3 minutes 7 seconds - replication_delay is 0
[01:03:17] <icinga-wm_>	 RECOVERY - Check health of redis instance on 6379 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 9190551 keys, up 3 minutes 11 seconds - replication_delay is 0
[01:03:18] <icinga-wm_>	 RECOVERY - Check health of redis instance on 6381 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 9094321 keys, up 3 minutes 12 seconds - replication_delay is 0
[01:03:27] <icinga-wm_>	 RECOVERY - Check health of redis instance on 6379 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 9186521 keys, up 3 minutes 21 seconds - replication_delay is 0
[01:17:13] <wikibugs>	 (03PS2) 10Dzahn: grafana: Add legend to dashboard varnish-aggregate-client-status-codes [puppet] - 10https://gerrit.wikimedia.org/r/363111 (owner: 10Krinkle)
[01:19:25] <wikibugs>	 (03CR) 10Krinkle: [C: 031] Stop forcing php5 in `mwscript` [puppet] - 10https://gerrit.wikimedia.org/r/358896 (https://phabricator.wikimedia.org/T146285) (owner: 10Chad)
[01:19:57] <wikibugs>	 (03CR) 10Dzahn: [C: 032] grafana: Add legend to dashboard varnish-aggregate-client-status-codes [puppet] - 10https://gerrit.wikimedia.org/r/363111 (owner: 10Krinkle)
[01:20:33] <wikibugs>	 (03CR) 10Krinkle: [C: 031] "Jobs don't use mwscript afaik. We use the standalone mediawiki/services/jobrunner service (PHP-based), which curls to localhost/rpc/RunJob" [puppet] - 10https://gerrit.wikimedia.org/r/358896 (https://phabricator.wikimedia.org/T146285) (owner: 10Chad)
[01:48:19] <icinga-wm_>	 PROBLEM - HHVM rendering on mw2223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:49:17] <icinga-wm_>	 RECOVERY - HHVM rendering on mw2223 is OK: HTTP OK: HTTP/1.1 200 OK - 80005 bytes in 0.403 second response time
[02:59:27] <icinga-wm_>	 PROBLEM - puppet last run on analytics1069 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:05:58] <wikibugs>	 (03PS6) 10Dzahn: icinga/role:mail::mx: add monitoring of exim queue size [puppet] - 10https://gerrit.wikimedia.org/r/361023 (https://phabricator.wikimedia.org/T133110)
[03:10:15] <wikibugs>	 (03CR) 10Dzahn: icinga/role:mail::mx: add monitoring of exim queue size (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/361023 (https://phabricator.wikimedia.org/T133110) (owner: 10Dzahn)
[03:10:50] <wikibugs>	 (03CR) 10Dzahn: icinga/role:mail::mx: add monitoring of exim queue size (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/361023 (https://phabricator.wikimedia.org/T133110) (owner: 10Dzahn)
[03:16:25] <wikibugs>	 (03CR) 10Dzahn: icinga/role:mail::mx: add monitoring of exim queue size (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/361023 (https://phabricator.wikimedia.org/T133110) (owner: 10Dzahn)
[03:27:27] <icinga-wm_>	 RECOVERY - puppet last run on analytics1069 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures
[03:32:07] <icinga-wm_>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 637.97 seconds
[03:35:08] <icinga-wm_>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 254.94 seconds
[04:44:37] <icinga-wm_>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=1381.90 Read Requests/Sec=403.60 Write Requests/Sec=1.60 KBytes Read/Sec=50345.60 KBytes_Written/Sec=22.00
[04:51:37] <icinga-wm_>	 RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=5.90 Read Requests/Sec=0.40 Write Requests/Sec=0.70 KBytes Read/Sec=2.00 KBytes_Written/Sec=6.80
[05:11:02] <wikibugs>	 10Operations, 10ArchCom-RfC, 10Traffic, 10Services (designing): Make API usage limits easier to understand, implement, and more adaptive to varying request costs / concurrency limiting - https://phabricator.wikimedia.org/T167906#3414481 (10MZMcBride) This task feels very "we should build a gun today and we...
[05:31:07] <icinga-wm_>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[05:31:07] <icinga-wm_>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[05:36:07] <icinga-wm_>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[05:39:07] <icinga-wm_>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[06:13:55] <wikibugs>	 10Operations, 10DBA, 10Performance-Team, 10Traffic, 10Wikidata: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3414623 (10jcrespo) Probably related: T169884
[06:49:46] <moritzm>	 !log rebooting bast3002 for kernel update
[06:49:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:50:07] <icinga-wm_>	 PROBLEM - puppet last run on mw2118 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata]
[07:09:10] <wikibugs>	 10Operations, 10ops-esams: bast3002 didn't come up after reboot - https://phabricator.wikimedia.org/T169959#3414712 (10MoritzMuehlenhoff)
[07:11:57] <icinga-wm_>	 ACKNOWLEDGEMENT - Host bast3002 is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T169959
[07:18:27] <icinga-wm_>	 RECOVERY - puppet last run on mw2118 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures
[07:33:28] <wikibugs>	 (03PS2) 10Marostegui: Revert "db-codfw.php: Depool db2056" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363641
[07:34:49] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2056" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363641 (owner: 10Marostegui)
[07:35:03] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363778
[07:36:17] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2056" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363641 (owner: 10Marostegui)
[07:36:26] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2056" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363641 (owner: 10Marostegui)
[07:37:03] <logmsgbot>	 !log marostegui@tin scap failed: average error rate on 3/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/3888cca979647b9381a7739b0bdbc88e for details)
[07:37:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:37:13] <wikibugs>	 (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363778
[07:39:02] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2056 - T169510 (duration: 00m 43s)
[07:39:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:39:11] <stashbot>	 T169510: Setup dbstore2002 with 2 new mysql instances from production and enable GTID - https://phabricator.wikimedia.org/T169510
[07:39:12] <wikibugs>	 10Operations, 10ops-esams: bast3002 didn't come up after reboot - https://phabricator.wikimedia.org/T169959#3414712 (10Volans) @MoritzMuehlenhoff the broken disk was known: T169959  IIRC something similar already happened (cannot remember if for this very host or lvs3001) and Faidon was able to make it boot ag...
[07:40:19] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363778 (owner: 10Marostegui)
[07:41:13] <wikibugs>	 10Operations, 10DBA, 10Performance-Team, 10Traffic, 10Wikidata: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3414762 (10jcrespo) > I also wonder why some of those log warnings come from close() and others have the proper commitM...
[07:41:37] <icinga-wm_>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [1000.0]
[07:41:40] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363778 (owner: 10Marostegui)
[07:41:52] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363778 (owner: 10Marostegui)
[07:41:57] <icinga-wm_>	 PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [600.0]
[07:42:50] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1083 - T166204 (duration: 00m 42s)
[07:42:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:43:00] <stashbot>	 T166204: Convert unique keys into primary keys for some wiki tables on s1 - https://phabricator.wikimedia.org/T166204
[07:45:27] <icinga-wm_>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0]
[07:53:38] <wikibugs>	 10Operations: postgresql::ganglia on puppetdb servers - authentication failed - https://phabricator.wikimedia.org/T169953#3414783 (10akosiaris) > Do we still need this Ganglia plugin or should we simply remove it since Ganglia is deprecated?  We should remove it.  > Do you know if it has worked before and someho...
[07:55:53] <moritzm>	 !log installing libgcrypt security updates
[07:56:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:56:22] <elukey>	 anybody checking the mw exceptions? 
[07:56:26] <elukey>	 just saw the alert
[07:56:29] <dcausse>	 elukey: looking
[07:57:01] <elukey>	 thanks dcausse !
[08:00:26] <elukey>	 dcausse: let me know if you need any help, seems ES related from the stacktrace but I might be wrong
[08:00:40] <dcausse>	 seems related to cirrus, at least I see tons of errors since yesterday 9pm utc
[08:05:27] <icinga-wm_>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0]
[08:11:07] <dcausse>	 elukey: huge load spike on many nodes, very similar to what we've seen earlier this week (see T169498)
[08:11:08] <stashbot>	 T169498: Investigate load spikes on the elasticsearch cluster in eqiad - https://phabricator.wikimedia.org/T169498
[08:11:23] <dcausse>	 it seems to be recovering now
[08:11:57] <icinga-wm_>	 RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1001 is OK: OK: Less than 20.00% above the threshold [300.0]
[08:12:27] <elukey>	 okok
[08:12:35] <elukey>	 Cc: gehel 
[08:14:39] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Add temporary role::ores::stresstest [puppet] - 10https://gerrit.wikimedia.org/r/363780 (https://phabricator.wikimedia.org/T169246)
[08:14:47] <icinga-wm_>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0]
[08:14:57] <icinga-wm_>	 PROBLEM - Check systemd state on bast3002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[08:16:34] <TabbyCat>	 jynus / marostegui can we do T167031 ?
[08:16:34] <stashbot>	 T167031: Global rename of Idh0854 → Garam: supervision needed - https://phabricator.wikimedia.org/T167031
[08:16:44] <marostegui>	 TabbyCat: give me a sec
[08:17:01] <TabbyCat>	 claro
[08:20:11] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Site-requests, 10User-MarcoAurelio: Global rename of Idh0854 → Garam: supervision needed - https://phabricator.wikimedia.org/T167031#3414823 (10MarcoAurelio) 05stalled>03Open a:03MarcoAurelio
[08:20:47] <icinga-wm_>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0]
[08:21:00] <marostegui>	 TabbyCat: can you send me the meta link for the progress? so I can keep it open?
[08:21:07] <TabbyCat>	 sure
[08:21:17] <icinga-wm_>	 PROBLEM - HHVM rendering on mw2215 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[08:21:20] <TabbyCat>	 but I have to trigger the rename first marostegui 
[08:22:00] <TabbyCat>	 when you give me the okay we'll start
[08:22:05] <wikibugs>	 10Operations, 10ops-esams: bast3002 didn't come up after reboot - https://phabricator.wikimedia.org/T169959#3414827 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi Host is back with "disable system services" set to YES in the idrac configuration, https://wikitech.wikimedia.org/wiki/Platform-specific_docu...
[08:22:07] <icinga-wm_>	 RECOVERY - HHVM rendering on mw2215 is OK: HTTP OK: HTTP/1.1 200 OK - 79923 bytes in 0.289 second response time
[08:22:10] <marostegui>	 TabbyCat: I just want to double check if most of the wikis are on kowiki :)
[08:22:16] <marostegui>	 as specicied on the task
[08:23:11] <TabbyCat>	 you mean edits?
[08:23:17] <gehel>	 !log banning elastic1020 and elastic1026 from elasticsearch eqiad cluster
[08:23:19] <marostegui>	 sorry yes
[08:23:22] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Add role::ores::stresstest hieradata [labs/private] - 10https://gerrit.wikimedia.org/r/363782
[08:23:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:24:10] <TabbyCat>	 marostegui: https://meta.wikimedia.org/w/index.php?title=Special:CentralAuth&target=Idh0854
[08:24:16] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add role::ores::stresstest hieradata [labs/private] - 10https://gerrit.wikimedia.org/r/363782 (owner: 10Alexandros Kosiaris)
[08:25:05] <marostegui>	 TabbyCat: gracias, that is what I wanted. You can go ahead if you like!
[08:25:08] <marostegui>	 I am ready
[08:25:29] <TabbyCat>	 okay, give me a sec
[08:25:44] <marostegui>	 TabbyCat: would you !log that or you want me to?
[08:26:06] <wikibugs>	 (03PS2) 10D3r1ck01: Remove 'din' from wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362876 (https://phabricator.wikimedia.org/T168523)
[08:26:08] <TabbyCat>	 I can do that when I start :)
[08:26:18] <marostegui>	 perfecto! :)
[08:26:47] <icinga-wm_>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0]
[08:27:57] <TabbyCat>	 !log Starting global rename of Idh0854 → Garam (T167031)
[08:28:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:28:07] <stashbot>	 T167031: Global rename of Idh0854 → Garam: supervision needed - https://phabricator.wikimedia.org/T167031
[08:29:31] <marostegui>	 TabbyCat: Send me the meta url once you've got it, so I can check the progress too and check the different shards :-)
[08:29:37] <TabbyCat>	 marostegui: I got to usurp Garam first if the target consented
[08:29:44] <TabbyCat>	 gimme a min
[08:29:47] <marostegui>	 sure
[08:31:47] <icinga-wm_>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[08:32:14] <TabbyCat>	 marostegui: in progress now https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/Garam
[08:32:31] <moritzm>	 !log installing expat security updates
[08:32:41] <marostegui>	 TabbyCat: gracias!
[08:32:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:35:37] <TheDragonFire>	 Sorry to disturb, but I have a small request for assistance. A user is having difficulty logging in over at #wikimedia-tech and has tried two different devices on different networks, incognito mode, and different browsers, and isn't globally locked or IP blocked.
[08:35:39] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: role::puppetmaster::common: add environments support [puppet] - 10https://gerrit.wikimedia.org/r/362985
[08:39:47] <icinga-wm_>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0]
[08:39:57] <_joe_>	 !log disabling puppet across the fleet for enabling directory environments in puppet
[08:40:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:02] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] role::puppetmaster::common: add environments support [puppet] - 10https://gerrit.wikimedia.org/r/362985 (owner: 10Giuseppe Lavagetto)
[08:43:16] <TheDragonFire>	 marostegui TabbyCat: ^ (if you're busy this second, very happy to hang for a bit)
[08:43:51] <TabbyCat>	 marostegui: you decide, I already did my part :)
[08:44:27] <marostegui>	 TabbyCat: I still see lots of wikis queued
[08:44:27] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: puppetmaster::gitclone: link environments to /etc/puppet [puppet] - 10https://gerrit.wikimedia.org/r/363784
[08:44:45] <TabbyCat>	 marostegui: I mean TheDragonFire request above
[08:44:56] <TabbyCat>	 I cannot decide on what he said
[08:45:18] <TabbyCat>	 oh wait
[08:45:26] <marostegui>	 But, how is that blocking us?
[08:45:31] <marostegui>	 I am a bit lost :)
[08:45:36] <TabbyCat>	 so I was
[08:45:38] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster::gitclone: link environments to /etc/puppet [puppet] - 10https://gerrit.wikimedia.org/r/363784 (owner: 10Giuseppe Lavagetto)
[08:45:47] <TabbyCat>	 I though it was that role::puppetmaster
[08:45:51] <TabbyCat>	 lol
[08:45:58] <TabbyCat>	 I need another cup of coffee
[08:46:27] <TabbyCat>	 TheDragonFire: I'd say to ask on #wikimedia-ops
[08:47:05] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10Performance-Team, 10Thumbor, 10Patch-For-Review: Beta thumbnails are broken - https://phabricator.wikimedia.org/T169114#3414879 (10fgiunchedi) 05Open>03Resolved Looks like this is fixed, we don't have poolcounter in beta I think? Anyways if we do we can...
[08:48:25] <TheDragonFire>	 jfsamper: #wikimedia-ops is for IRC ops is it not?
[08:48:33] <TheDragonFire>	 TabbyCat*
[08:48:48] <TabbyCat>	 yes
[08:49:01] <TabbyCat>	 maybe he's triggering the login limit the channel has set
[08:49:10] <TabbyCat>	 an operator can /invite him/her there
[08:50:29] <TheDragonFire>	 TabbyCat: The user is having problems logging onto Wikipedia, not IRC. But someone's just suggested trying login.wikimedia.org so we'll see how that goes.
[08:50:39] <TabbyCat>	 ah ah
[08:50:44] <TabbyCat>	 hmm
[08:50:51] <TabbyCat>	 right
[08:51:02] <TabbyCat>	 which error message does him receive?
[08:51:26] <TabbyCat>	 I'm on -tech, will follow-up there
[08:52:30] <_joe_>	 !log restarting apache on all puppetmaster, after a successful puppet run
[08:52:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:54:04] <_joe_>	 !log reenabling puppet across the fleet
[08:54:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:56:48] <moritzm>	 !log restarting HHVM on app server canaries to pick up libgcrypt and expat updates
[08:56:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:37] <icinga-wm_>	 PROBLEM - Check systemd state on ores1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[08:59:37] <wikibugs>	 10Operations, 10DBA: Evaluate how hard would be to get aa(wikibooks|wiktionary) and howiki databases deleted - https://phabricator.wikimedia.org/T169928#3413499 (10Marostegui) Deleting it just from the database can create inconsistencies. I wouldn't feel too comfortable just issuing a drop database in producti...
[09:01:37] <icinga-wm_>	 RECOVERY - Check systemd state on ores1001 is OK: OK - running: The system is fully operational
[09:02:56] <wikibugs>	 (03PS3) 10Filippo Giunchedi: Deployment-Prep: Set correct restbase_uri for Change Propagation [puppet] - 10https://gerrit.wikimedia.org/r/363638 (https://phabricator.wikimedia.org/T169912) (owner: 10Ppchelko)
[09:08:02] <wikibugs>	 (03CR) 10Filippo Giunchedi: icinga/role:mail::mx: add monitoring of exim queue size (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/361023 (https://phabricator.wikimedia.org/T133110) (owner: 10Dzahn)
[09:08:25] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] Deployment-Prep: Set correct restbase_uri for Change Propagation [puppet] - 10https://gerrit.wikimedia.org/r/363638 (https://phabricator.wikimedia.org/T169912) (owner: 10Ppchelko)
[09:11:32] <wikibugs>	 (03PS1) 10Filippo Giunchedi: puppetmaster: deactivate node in wmf-reimage [puppet] - 10https://gerrit.wikimedia.org/r/363789
[09:11:41] <wikibugs>	 10Operations, 10Scoring-platform-team: rack/setup/install ores1001-1009 - https://phabricator.wikimedia.org/T165171#3414940 (10akosiaris)
[09:11:47] <icinga-wm_>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[09:11:54] <wikibugs>	 10Operations, 10Scoring-platform-team: rack/setup/install ores1001-1009 - https://phabricator.wikimedia.org/T165171#3258960 (10akosiaris)
[09:13:32] <wikibugs>	 10Operations, 10Scoring-platform-team: rack/setup/install ores1001-1009 - https://phabricator.wikimedia.org/T165171#3258960 (10akosiaris) 05Open>03Resolved Per @Ladsgroup 's comment we better handle the service implementation in T168073. Which is btw gonna be stalled as we are going to stress test a bit th...
[09:15:58] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Add temporary role::ores::stresstest [puppet] - 10https://gerrit.wikimedia.org/r/363780 (https://phabricator.wikimedia.org/T169246)
[09:16:10] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add temporary role::ores::stresstest [puppet] - 10https://gerrit.wikimedia.org/r/363780 (https://phabricator.wikimedia.org/T169246) (owner: 10Alexandros Kosiaris)
[09:16:20] <godog>	 _joe_: https://gerrit.wikimedia.org/r/363789
[09:16:47] <icinga-wm_>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0]
[09:18:32] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster: deactivate node in wmf-reimage [puppet] - 10https://gerrit.wikimedia.org/r/363789 (owner: 10Filippo Giunchedi)
[09:22:17] <icinga-wm_>	 PROBLEM - DPKG on ores1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:22:17] <icinga-wm_>	 PROBLEM - puppet last run on ores1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:22:27] <icinga-wm_>	 PROBLEM - Check systemd state on ores1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:22:27] <akosiaris>	 those are ok ^
[09:22:48] <icinga-wm_>	 PROBLEM - DPKG on ores1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:22:57] <icinga-wm_>	 PROBLEM - Disk space on ores1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:22:57] <icinga-wm_>	 PROBLEM - MD RAID on ores1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:22:59] <marostegui>	 TabbyCat: almost there
[09:23:07] <icinga-wm_>	 PROBLEM - DPKG on ores1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:23:07] <icinga-wm_>	 PROBLEM - dhclient process on ores1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:23:07] <icinga-wm_>	 PROBLEM - salt-minion processes on ores1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:23:17] <icinga-wm_>	 PROBLEM - configured eth on ores1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:23:17] <icinga-wm_>	 RECOVERY - Check systemd state on ores1003 is OK: OK - running: The system is fully operational
[09:23:17] <icinga-wm_>	 PROBLEM - configured eth on ores1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:23:17] <icinga-wm_>	 RECOVERY - puppet last run on ores1003 is OK: OK: Puppet is currently enabled, last run 29 minutes ago with 0 failures
[09:23:17] <icinga-wm_>	 RECOVERY - DPKG on ores1003 is OK: All packages OK
[09:23:17] <icinga-wm_>	 PROBLEM - puppet last run on ores1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:23:24] <TabbyCat>	 marostegui: yep, after kowiki was done it started to go faster
[09:23:31] <akosiaris>	 !log schedule a month's worth of downtime for ores100X
[09:23:32] <TabbyCat>	 I looked 10 minutes ago
[09:23:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:23:47] <icinga-wm_>	 RECOVERY - Disk space on ores1003 is OK: DISK OK
[09:23:48] <icinga-wm_>	 RECOVERY - MD RAID on ores1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[09:23:57] <icinga-wm_>	 RECOVERY - DPKG on ores1001 is OK: All packages OK
[09:23:57] <icinga-wm_>	 RECOVERY - dhclient process on ores1001 is OK: PROCS OK: 0 processes with command name dhclient
[09:23:57] <icinga-wm_>	 RECOVERY - salt-minion processes on ores1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[09:24:03] <wikibugs>	 (03PS2) 10Filippo Giunchedi: puppetmaster: deactivate node in wmf-reimage [puppet] - 10https://gerrit.wikimedia.org/r/363789
[09:24:07] <icinga-wm_>	 RECOVERY - configured eth on ores1003 is OK: OK - interfaces up
[09:24:07] <icinga-wm_>	 RECOVERY - configured eth on ores1001 is OK: OK - interfaces up
[09:24:07] <icinga-wm_>	 RECOVERY - puppet last run on ores1001 is OK: OK: Puppet is currently enabled, last run 18 minutes ago with 0 failures
[09:24:11] <TabbyCat>	 we're on "u"
[09:24:47] <icinga-wm_>	 RECOVERY - DPKG on ores1004 is OK: All packages OK
[09:24:59] <moritzm>	 !log installing NTP security updates on trusty hosts
[09:25:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:29:03] <marostegui>	 TabbyCat: finished!
[09:29:26] <TabbyCat>	 marostegui: good
[09:30:14] <TabbyCat>	 !log Global rename of Idh0854 → Garam has finished (T167031)
[09:30:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:30:26] <stashbot>	 T167031: Global rename of Idh0854 → Garam: supervision needed - https://phabricator.wikimedia.org/T167031
[09:31:39] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Site-requests, 10User-MarcoAurelio: Global rename of Idh0854 → Garam: supervision needed - https://phabricator.wikimedia.org/T167031#3414996 (10MarcoAurelio) 05Open>03Resolved Thanks to @marostegui for his help.
[09:37:17] <gehel>	 !log restarting elastic1036 (corrupted statistics)
[09:37:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:41:11] <icinga-wm_>	 PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 3.39% of data above the critical threshold [1000.0]
[09:42:09] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: puppetmaster: add environment for the future parser [puppet] - 10https://gerrit.wikimedia.org/r/363790 (https://phabricator.wikimedia.org/T169485)
[09:42:14] <godog>	 that's ores btw ^ the too many creates
[09:42:58] <gehel>	 !log unbanning elastic1020 and 1026 from elasticsearch eqiad
[09:43:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:43:53] <akosiaris>	 godog: yeah bringing the stresstesting cluster online
[09:43:58] <akosiaris>	 I am done actually
[09:45:13] <godog>	 akosiaris: *nod* I'm opening a task to regularly purge ores metrics, almost half is a month old
[09:45:29] <wikibugs>	 (03PS1) 10Elukey: redis::monitoring::nrpe_instance: set retry_interval to 60s [puppet] - 10https://gerrit.wikimedia.org/r/363791
[09:45:35] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] "nice!" [puppet] - 10https://gerrit.wikimedia.org/r/363790 (https://phabricator.wikimedia.org/T169485) (owner: 10Giuseppe Lavagetto)
[09:46:31] <wikibugs>	 10Operations, 10Graphite, 10ORES, 10Scoring-platform-team-Backlog, 10User-fgiunchedi: Regularly purge old ores graphite metrics - https://phabricator.wikimedia.org/T169969#3415026 (10fgiunchedi)
[09:47:39] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster: add environment for the future parser [puppet] - 10https://gerrit.wikimedia.org/r/363790 (https://phabricator.wikimedia.org/T169485) (owner: 10Giuseppe Lavagetto)
[09:50:21] <icinga-wm_>	 PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[09:50:31] <icinga-wm_>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[09:53:21] <icinga-wm_>	 PROBLEM - puppet last run on mw1261 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:54:21] <icinga-wm_>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[09:54:27] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] redis::monitoring::nrpe_instance: set retry_interval to 60s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/363791 (owner: 10Elukey)
[09:59:21] <icinga-wm_>	 RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[09:59:21] <icinga-wm_>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[09:59:31] <icinga-wm_>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[09:59:40] <wikibugs>	 10Operations, 10DBA, 10Scoring-platform-team, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3415085 (10jcrespo) Someone announced 60 seconds of downtime, which I do not think is reasonable- rebooting fully a server and all its services takes around 3...
[10:00:21] <icinga-wm_>	 RECOVERY - puppet last run on mw1261 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures
[10:02:02] <icinga-wm_>	 RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 0
[10:08:51] <icinga-wm_>	 PROBLEM - NTP on db1069 is CRITICAL: NTP CRITICAL: Offset unknown
[10:09:10] <wikibugs>	 (03CR) 10Gehel: [C: 031] "LGTM (and so much nicer to be able to reuse this QueryBuilder)." [software/cumin] - 10https://gerrit.wikimedia.org/r/363750 (owner: 10Volans)
[10:09:11] <icinga-wm_>	 PROBLEM - NTP on db1026 is CRITICAL: NTP CRITICAL: Offset unknown
[10:09:17] <wikibugs>	 10Operations, 10Graphite, 10User-fgiunchedi: Delete "servers" metrics in graphite older than 60d - https://phabricator.wikimedia.org/T169972#3415090 (10fgiunchedi)
[10:12:01] <icinga-wm_>	 PROBLEM - nutcracker process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:12:11] <icinga-wm_>	 PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:13:31] <icinga-wm_>	 PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:13:51] <icinga-wm_>	 RECOVERY - nutcracker process on thumbor1001 is OK: PROCS OK: 1 process with UID = 115 (nutcracker), command name nutcracker
[10:14:01] <icinga-wm_>	 RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[10:14:21] <icinga-wm_>	 RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient
[10:15:49] <wikibugs>	 (03PS2) 10Elukey: redis::monitoring::nrpe_instance: set retry_interval to 2 mins [puppet] - 10https://gerrit.wikimedia.org/r/363791
[10:16:27] <_joe_>	 uh what's up on thumbor?
[10:24:18] <wikibugs>	 (03CR) 10Gehel: Configuration: automatically load backend's aliases (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/363747 (https://phabricator.wikimedia.org/T169640) (owner: 10Volans)
[10:25:25] <zhuyifei1999_>	 godog: I purged it a few times already
[10:25:52] <wikibugs>	 10Operations, 10Graphite, 10User-fgiunchedi: Something puts many different metrics into graphite, allocating a lot of disk space - https://phabricator.wikimedia.org/T1075#3415115 (10fgiunchedi)
[10:27:41] <zhuyifei1999_>	 i.e. action=purge on the main web page, then trying to bypass varnish by adding http parameters and X-wikimedia-debug
[10:29:18] <zhuyifei1999_>	 also can you check https://phabricator.wikimedia.org/T168002#3377446? the file is deleted ages ago
[10:30:48] <gehel>	 !log restarting elastic1043 (corrupted statistics)
[10:30:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:36:11] <godog>	 zhuyifei1999_: not today sorry, they'll eventually fall out of varnish though if they are gone from swift
[10:36:30] <zhuyifei1999_>	 um okay
[10:36:36] <zhuyifei1999_>	 can I have an eta?
[10:38:51] <icinga-wm_>	 RECOVERY - NTP on db1069 is OK: NTP OK: Offset -5.36441803e-05 secs
[10:39:11] <icinga-wm_>	 RECOVERY - NTP on db1026 is OK: NTP OK: Offset -9.620189667e-05 secs
[10:45:13] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: profile::base: allow setting a puppet environment [puppet] - 10https://gerrit.wikimedia.org/r/363797 (https://phabricator.wikimedia.org/T169485)
[10:46:33] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile::base: allow setting a puppet environment [puppet] - 10https://gerrit.wikimedia.org/r/363797 (https://phabricator.wikimedia.org/T169485) (owner: 10Giuseppe Lavagetto)
[10:48:42] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: profile::base: allow setting a puppet environment [puppet] - 10https://gerrit.wikimedia.org/r/363797 (https://phabricator.wikimedia.org/T169485)
[10:53:13] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: profile::base: allow setting a puppet environment [puppet] - 10https://gerrit.wikimedia.org/r/363797 (https://phabricator.wikimedia.org/T169485)
[10:55:59] <godog>	 zhuyifei1999_: TTL of varnish is ~7d IIRC
[10:56:53] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] profile::base: allow setting a puppet environment [puppet] - 10https://gerrit.wikimedia.org/r/363797 (https://phabricator.wikimedia.org/T169485) (owner: 10Giuseppe Lavagetto)
[10:57:03] <zhuyifei1999_>	 https://commons.wikimedia.org/wiki/File:Dfsdfsdfsdfsdf.webm is deleted on June 17
[11:00:16] <godog>	 zhuyifei1999_: ok that file was still in swift, I've deleted it
[11:00:24] <zhuyifei1999_>	 k
[11:00:28] <zhuyifei1999_>	 thx
[11:01:13] <godog>	 zhuyifei1999_: looks like a bug on mw side though, I suggest you add some mediawiki projects too
[11:19:54] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove expiry date for jdcc [puppet] - 10https://gerrit.wikimedia.org/r/363535
[11:23:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove expiry date for jdcc [puppet] - 10https://gerrit.wikimedia.org/r/363535 (owner: 10Muehlenhoff)
[11:28:51] <elukey>	 akosiaris: I'd try https://gerrit.wikimedia.org/r/#/c/363791, seems safe enough
[11:40:56] <moritzm>	 !log rebooting rdb* servers in codfw for kernel update
[11:41:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:51:21] <icinga-wm_>	 RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0]
[12:20:11] <icinga-wm_>	 PROBLEM - MariaDB Slave SQL: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[12:20:12] <icinga-wm_>	 PROBLEM - MariaDB Slave IO: m3 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[12:20:12] <icinga-wm_>	 PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[12:20:14] <elukey>	 !log restart mysql on dbstore1002 - high swap used
[12:20:21] <icinga-wm_>	 PROBLEM - MariaDB Slave IO: s6 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[12:20:22] <icinga-wm_>	 PROBLEM - MariaDB Slave IO: s2 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[12:20:22] <icinga-wm_>	 PROBLEM - MariaDB Slave SQL: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[12:20:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:20:36] <elukey>	 aaand it wasn't downtimed anymore
[12:20:41] <elukey>	 good job Luca
[12:20:41] <icinga-wm_>	 PROBLEM - MariaDB Slave IO: s7 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[12:20:41] <icinga-wm_>	 PROBLEM - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[12:20:41] <icinga-wm_>	 PROBLEM - MariaDB Slave IO: s1 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[12:20:41] <icinga-wm_>	 PROBLEM - MariaDB Slave IO: s5 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[12:20:41] <icinga-wm_>	 PROBLEM - MariaDB Slave SQL: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[12:20:41] <icinga-wm_>	 PROBLEM - MariaDB Slave IO: x1 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[12:20:41] <icinga-wm_>	 PROBLEM - MariaDB Slave SQL: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[12:20:42] <icinga-wm_>	 PROBLEM - MariaDB Slave SQL: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[12:20:42] <icinga-wm_>	 PROBLEM - MariaDB Slave IO: s4 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[12:20:43] <icinga-wm_>	 PROBLEM - MariaDB Slave IO: s3 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[12:20:51] <icinga-wm_>	 PROBLEM - MariaDB Slave SQL: m3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[12:20:52] <icinga-wm_>	 PROBLEM - MariaDB Slave SQL: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[12:22:21] <icinga-wm_>	 PROBLEM - MariaDB Slave SQL: m2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[12:22:51] <icinga-wm_>	 PROBLEM - MariaDB Slave IO: m2 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[12:23:51] <icinga-wm_>	 RECOVERY - MariaDB Slave IO: m2 on dbstore1002 is OK: OK slave_io_state not a slave
[12:24:22] <icinga-wm_>	 RECOVERY - MariaDB Slave SQL: m2 on dbstore1002 is OK: OK slave_sql_state not a slave
[12:27:11] <icinga-wm_>	 RECOVERY - MariaDB Slave IO: m3 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[12:27:11] <icinga-wm_>	 RECOVERY - MariaDB Slave SQL: s1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[12:27:12] <icinga-wm_>	 RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[12:27:21] <elukey>	 sorry for the noise 
[12:27:21] <icinga-wm_>	 RECOVERY - MariaDB Slave IO: s6 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[12:27:21] <icinga-wm_>	 RECOVERY - MariaDB Slave IO: s2 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[12:27:22] <icinga-wm_>	 RECOVERY - MariaDB Slave SQL: s4 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[12:27:41] <icinga-wm_>	 RECOVERY - MariaDB Slave IO: s7 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[12:27:41] <icinga-wm_>	 RECOVERY - MariaDB Slave IO: s1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[12:27:41] <icinga-wm_>	 RECOVERY - MariaDB Slave SQL: s3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[12:27:41] <icinga-wm_>	 RECOVERY - MariaDB Slave IO: s5 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[12:27:41] <icinga-wm_>	 RECOVERY - MariaDB Slave SQL: s2 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[12:27:41] <icinga-wm_>	 RECOVERY - MariaDB Slave SQL: s6 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[12:27:41] <icinga-wm_>	 RECOVERY - MariaDB Slave IO: x1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[12:27:42] <icinga-wm_>	 RECOVERY - MariaDB Slave SQL: s5 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[12:27:42] <icinga-wm_>	 RECOVERY - MariaDB Slave IO: s4 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[12:27:43] <icinga-wm_>	 RECOVERY - MariaDB Slave IO: s3 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[12:27:51] <icinga-wm_>	 RECOVERY - MariaDB Slave SQL: s7 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[12:27:51] <icinga-wm_>	 RECOVERY - MariaDB Slave SQL: m3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[12:42:42] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: Rationalize and centralize directory references [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363216
[12:42:44] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: Generalize state management, allow multiple run modes [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363217
[12:42:46] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: Add coverage report [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363350
[12:42:48] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: Raise test coverage percentage [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363351
[12:42:50] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: [WiP] Add future parser run mode [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363808
[12:43:33] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WiP] Add future parser run mode [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363808 (owner: 10Giuseppe Lavagetto)
[12:46:46] <imadoz>	 Hello guys how's everybody
[12:47:05] <imadoz>	 I come from an ISP, currently unable to reach wikipedia.org
[12:47:22] <imadoz>	 Some users are complaining about this isse. Is there anyone around I can check the problem with? THX
[12:47:48] <elukey>	 XioNoX: --^
[12:47:58] <TheresNoTime>	 imadoz: Hello again - are you able to perform a traceroute from the network?
[12:48:40] <imadoz>	 Sure
[12:49:03] <elukey>	 it would also be good to have an address to traceroute back from our network, we could open a phabricator task and set it confidential
[12:49:03] <imadoz>	 I have actually configured a forwarder for all DNS queries destined for wikipedia.org
[12:50:08] <TheresNoTime>	 imadoz: is wikipedia.org resolving at all?
[12:50:32] <imadoz>	 Of course, should I paste the trace here?
[12:50:52] <TheresNoTime>	 imadoz: a pastebin would be preferred
[12:51:43] <XioNoX>	 What's the ISP/country? 
[12:52:17] <TheresNoTime>	 XioNoX: Lebanon, not sure on ISP
[12:52:43] <XioNoX>	 yeah, we got a report yesterday 
[12:52:58] <TheresNoTime>	 Oh, routing? :/
[12:53:21] <XioNoX>	 mnets.net? 
[12:53:50] <TheresNoTime>	 imadoz: ^?
[12:54:19] <XioNoX>	 so far it seems like there is a middleman blocking traffic to Wikipedia's European datacenter from that provider 
[12:54:45] <XioNoX>	 dropping http sessions, and not letting https establishing 
[12:54:50] <imadoz>	 https://pastebin.com/embed_js/eZL8302R
[12:55:18] <TheresNoTime>	 imadoz: which ISP are you from?
[12:55:38] <imadoz>	 ISP is Broadband Plus from Lebanon
[12:55:49] <imadoz>	 You can use the IP address 62.84.80.202 to trace back to us
[12:55:58] <imadoz>	 It is my current NAT IP
[12:57:44] <TheresNoTime>	 imadoz: https://www.ripe.net/membership/indices/data/lb.broadbandplus.html ? Should cedarcom.net serve anything? I get redirected to http://www.mobi.net.lb but it times out
[12:58:04] <imadoz>	 Broaband Plus /Cedarcom /Mobi same company
[12:59:01] <imadoz>	 We are doing some maintenance on our website, and it is currently down
[12:59:16] <imadoz>	 It will be back up in minutes, but this is not related
[13:00:11] <_joe_>	 imadoz: do you know if other providers in Lebanon are having the same issue?
[13:00:23] <imadoz>	 Did not check with other providers
[13:00:45] <imadoz>	 We have a DDOS mitigation service running 24/7 as well, therefore all our traffic is tunneled towards another provider in RUSSIA
[13:01:26] <imadoz>	 I am currently in the proccess of advertising one /24 subnet directly to our TELCO and checking the issue if it gets resolved
[13:02:44] <XioNoX>	 let's start with the output of curl -v https://en.wikipedia.org/wiki/Main_Page and "telnet 91.198.174.192 443"
[13:02:45] <XioNoX>	 Then: can you edit /etc/hosts and add the line "208.80.153.224   en.wikipedia.org" (without the quotes) and then wait a few minutes and share "curl -v https://en.wikipedia.org/wiki/Main_Page"
[13:02:48] <XioNoX>	 imadoz: ^
[13:03:12] <XioNoX>	 to make sure it's the same symptoms
[13:05:18] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: db1079 as sanitarium3 master for s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363815 (https://phabricator.wikimedia.org/T153743)
[13:10:54] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "fwiw, we do use the slaveof command extensively when doing switchovers." [puppet] - 10https://gerrit.wikimedia.org/r/251800 (owner: 10Ori.livneh)
[13:13:56] <TheresNoTime>	 imadoz: please do let us know if there's anything that doesn't quite make sense :) it appeared from the above pastebin you may be using a Windows system, and might not have access to the curl command. If you have access to powershell curl is an alias for a similar command and the above syntax will work
[13:15:24] <imadoz>	 Sorry for the delay guys
[13:16:24] <TheresNoTime>	 Not a problem :-)
[13:16:32] <imadoz>	 We are checking with our DDOS mitigation provider if the problem is due to a middleman blocking traffic like you mentioned
[13:17:12] <imadoz>	 While the subnet was advertised towards our TELCO directly, the website worked fine. Therefore, I believe no problem exists from our side or yours
[13:17:15] <imadoz>	 I will keep you posted
[13:18:10] <TheresNoTime>	 Good to hear!
[13:27:27] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Force integer pool size [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363826
[13:27:53] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Force integer pool size [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363826 (owner: 10Giuseppe Lavagetto)
[13:28:31] <wikibugs>	 (03Merged) 10jenkins-bot: Force integer pool size [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363826 (owner: 10Giuseppe Lavagetto)
[13:29:26] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Rationalize and centralize directory references [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363216 (owner: 10Giuseppe Lavagetto)
[13:30:37] <wikibugs>	 (03PS1) 10Marostegui: s2.hosts: Add dbstore2002 port 3312 [software] - 10https://gerrit.wikimedia.org/r/363827 (https://phabricator.wikimedia.org/T169510)
[13:31:11] <wikibugs>	 (03PS2) 10Jcrespo: Revert "install_server: Change db1098 MAC address to the one that shows link" [puppet] - 10https://gerrit.wikimedia.org/r/363565
[13:31:24] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: Generalize state management, allow multiple run modes [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363217
[13:31:48] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] redis::monitoring::nrpe_instance: set retry_interval to 2 mins [puppet] - 10https://gerrit.wikimedia.org/r/363791 (owner: 10Elukey)
[13:33:25] <wikibugs>	 (03CR) 10Marostegui: [C: 032] s2.hosts: Add dbstore2002 port 3312 [software] - 10https://gerrit.wikimedia.org/r/363827 (https://phabricator.wikimedia.org/T169510) (owner: 10Marostegui)
[13:34:12] <wikibugs>	 (03Merged) 10jenkins-bot: s2.hosts: Add dbstore2002 port 3312 [software] - 10https://gerrit.wikimedia.org/r/363827 (https://phabricator.wikimedia.org/T169510) (owner: 10Marostegui)
[13:45:30] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] Revert "install_server: Change db1098 MAC address to the one that shows link" [puppet] - 10https://gerrit.wikimedia.org/r/363565 (owner: 10Jcrespo)
[14:04:51] <icinga-wm_>	 RECOVERY - Check systemd state on bast3002 is OK: OK - running: The system is fully operational
[14:05:49] <wikibugs>	 10Operations, 10Mail: Increase email log retention period for the main email relays - https://phabricator.wikimedia.org/T167333#3415559 (10herron) 05Open>03Resolved Confirming that the updated logrotate config works.  Files have been rotated to .11.gz.    -rw-r----- 1 Debian-exim adm 22M Jun 27 06:25 /var/...
[14:17:52] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] Update to mariadb 10.1.25, support multi-instance, move unit path [software] - 10https://gerrit.wikimedia.org/r/363327 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo)
[14:18:20] <wikibugs>	 10Operations, 10Analytics-Kanban, 10DBA, 10Patch-For-Review, 10User-Elukey: Puppetize Piwik's Database and set up periodical backups - https://phabricator.wikimedia.org/T164073#3415609 (10elukey) A lot of things changed from my last post, most of them due to the fact that now the apps are not sending any...
[14:18:38] <wikibugs>	 (03PS7) 10Jcrespo: mariadb: Support multiple instances directly on the module [puppet] - 10https://gerrit.wikimedia.org/r/363195 (https://phabricator.wikimedia.org/T169514)
[14:22:02] <imadoz>	 @TheresNoTime @XioNoX @_joe_ Thanks guys, problem was resolved successfully!
[14:22:22] <TheresNoTime>	 No worries, thanks for dropping by :-)
[14:22:29] <imadoz>	 @elukey as well
[14:22:32] <imadoz>	 C you guys
[14:22:55] <elukey>	 thank you! 
[14:23:23] <wikibugs>	 (03PS8) 10Jcrespo: mariadb: Support multiple instances directly on the module [puppet] - 10https://gerrit.wikimedia.org/r/363195 (https://phabricator.wikimedia.org/T169514)
[14:34:32] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labtestmetal2001.codfw.wmnet - https://phabricator.wikimedia.org/T168891#3415654 (10Papaul)
[14:35:40] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labtestmetal2001.codfw.wmnet - https://phabricator.wikimedia.org/T168891#3379777 (10Papaul) a:05Papaul>03chasemp @chasemp This is complete , You can take over.  Thanks.
[14:37:57] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Support multiple instances directly on the module [puppet] - 10https://gerrit.wikimedia.org/r/363195 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo)
[14:39:28] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-1] "Let's test on another host first." [puppet] - 10https://gerrit.wikimedia.org/r/363204 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo)
[14:41:17] <icinga-wm_>	 PROBLEM - puppet last run on db1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:42:03] <icinga-wm_>	 PROBLEM - mysqld processes on db1102 is CRITICAL: PROCS CRITICAL: 2 processes with command name mysqld
[14:42:42] <icinga-wm_>	 PROBLEM - mysqld processes on dbstore2002 is CRITICAL: PROCS CRITICAL: 2 processes with command name mysqld
[14:42:55] <icinga-wm_>	 PROBLEM - puppet last run on db1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:42:56] <marostegui>	 icinga lost downtimes again
[14:42:56] <marostegui>	 :(
[14:43:00] <mark>	 :-(
[14:43:15] <icinga-wm_>	 PROBLEM - Check systemd state on labstore2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[14:43:27] <jynus>	 icinga?
[14:43:31] <jynus>	 ah, as expected
[14:43:38] <icinga-wm_>	 PROBLEM - kartotherian endpoints health on maps-test2004 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 400 (expecting:
[14:43:38] <icinga-wm_>	 }/{y}@{scale}x.{format} (default scaled tile) is CRITICAL: Test default scaled tile returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-intl) is CRITICAL: Test tile service info for osm-intl returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-pbf) is CRITICAL: Test tile service info for osm-pbf returned the unexpected status 400 (expectin
[14:43:38] <icinga-wm_>	 PROBLEM - nutcracker process on mw1228 is CRITICAL: Return code of 255 is out of bounds
[14:43:45] <icinga-wm_>	 PROBLEM - Disk space on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds
[14:43:48] <icinga-wm_>	 PROBLEM - Apache HTTP on mw1228 is CRITICAL: connect to address 10.64.48.63 and port 80: Connection refused
[14:43:58] <icinga-wm_>	 PROBLEM - puppet last run on mw1228 is CRITICAL: Return code of 255 is out of bounds
[14:43:58] <icinga-wm_>	 PROBLEM - Router interfaces on mr1-codfw is CRITICAL: CRITICAL: host 208.80.153.196, interfaces up: 35, down: 1, dormant: 0, excluded: 0, unused: 0BRge-0/0/2: down - Cust: Airport Express WiFi APBR
[14:44:07] <jynus>	 is the puppet part me?
[14:44:18] <icinga-wm_>	 PROBLEM - Check size of conntrack table on mw1228 is CRITICAL: Return code of 255 is out of bounds
[14:44:18] <icinga-wm_>	 PROBLEM - salt-minion processes on mw1228 is CRITICAL: Return code of 255 is out of bounds
[14:44:18] <icinga-wm_>	 PROBLEM - MD RAID on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds
[14:44:20] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] Add coverage report [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363350 (owner: 10Giuseppe Lavagetto)
[14:44:35] <icinga-wm_>	 PROBLEM - Check systemd state on mw1228 is CRITICAL: Return code of 255 is out of bounds
[14:44:48] <icinga-wm_>	 PROBLEM - Check the NTP synchronisation status of timesyncd on mw1228 is CRITICAL: Return code of 255 is out of bounds
[14:44:52] <jynus>	 yes, the puppet part is me
[14:44:55] <icinga-wm_>	 PROBLEM - salt-minion processes on ms-fe3002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[14:44:56] <icinga-wm_>	 PROBLEM - Restbase root url on restbase-dev1003 is CRITICAL: connect to address 10.64.48.46 and port 7231: Connection refused
[14:45:03] <jynus>	 but only on non-jessie hosts, maybe?
[14:45:08] <icinga-wm_>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1228 is CRITICAL: Return code of 255 is out of bounds
[14:45:15] <icinga-wm_>	 PROBLEM - SSH on restbase-dev1003 is CRITICAL: connect to address 10.64.48.46 and port 22: Connection refused
[14:45:28] <icinga-wm_>	 PROBLEM - DPKG on mw1228 is CRITICAL: Return code of 255 is out of bounds
[14:45:35] <icinga-wm_>	 PROBLEM - cassandra-a CQL 10.64.48.117:9042 on restbase-dev1003 is CRITICAL: connect to address 10.64.48.117 and port 9042: Connection refused
[14:45:37] <icinga-wm_>	 PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:45:45] <icinga-wm_>	 PROBLEM - Disk space on mw1228 is CRITICAL: Return code of 255 is out of bounds
[14:45:46] <icinga-wm_>	 PROBLEM - HHVM processes on mw2148 is CRITICAL: NRPE: Command check_hhvm not defined
[14:45:47] <icinga-wm_>	 PROBLEM - cassandra-a SSL 10.64.48.117:7001 on restbase-dev1003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[14:45:47] <icinga-wm_>	 PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:45:55] <icinga-wm_>	 PROBLEM - puppet last run on dbstore2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:45:56] <icinga-wm_>	 PROBLEM - HHVM rendering on mw2148 is CRITICAL: connect to address 10.192.32.36 and port 80: Connection refused
[14:45:57] <icinga-wm_>	 PROBLEM - HHVM processes on mw1228 is CRITICAL: Return code of 255 is out of bounds
[14:45:57] <icinga-wm_>	 PROBLEM - salt-minion processes on ms-fe3001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[14:46:05] <icinga-wm_>	 PROBLEM - puppet last run on db1048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:46:05] <icinga-wm_>	 PROBLEM - cassandra-a service on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds
[14:46:05] <icinga-wm_>	 PROBLEM - Check systemd state on restbase2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[14:46:09] <godog>	 gah, I'll fixup ms-fe3
[14:46:15] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] Raise test coverage percentage [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363351 (owner: 10Giuseppe Lavagetto)
[14:46:16] <icinga-wm_>	 PROBLEM - puppet last run on db1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:46:17] <icinga-wm_>	 PROBLEM - HHVM rendering on mw1228 is CRITICAL: connect to address 10.64.48.63 and port 80: Connection refused
[14:46:18] <wikibugs>	 (03PS3) 10Filippo Giunchedi: Recommendation API: Add the beta scap source [puppet] - 10https://gerrit.wikimedia.org/r/360686 (https://phabricator.wikimedia.org/T165760) (owner: 10Mobrovac)
[14:46:19] <icinga-wm_>	 PROBLEM - cassandra-b CQL 10.64.48.118:9042 on restbase-dev1003 is CRITICAL: connect to address 10.64.48.118 and port 9042: Connection refused
[14:46:35] <icinga-wm_>	 PROBLEM - Nginx local proxy to apache on mw2148 is CRITICAL: connect to address 10.192.32.36 and port 443: Connection refused
[14:46:35] <icinga-wm_>	 PROBLEM - Check systemd state on ms-fe3002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[14:46:35] <icinga-wm_>	 PROBLEM - puppet last run on dbstore1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:46:45] <icinga-wm_>	 PROBLEM - cassandra-b SSL 10.64.48.118:7001 on restbase-dev1003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[14:46:46] <icinga-wm_>	 PROBLEM - Nginx local proxy to apache on mw1228 is CRITICAL: connect to address 10.64.48.63 and port 443: Connection refused
[14:46:55] <icinga-wm_>	 PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:46:55] <icinga-wm_>	 PROBLEM - puppet last run on labtestpuppetmaster2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 19 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[apache2]
[14:46:55] <icinga-wm_>	 PROBLEM - cassandra-b service on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds
[14:47:05] <icinga-wm_>	 PROBLEM - puppetmaster https on labtestpuppetmaster2001 is CRITICAL: connect to address 208.80.153.108 and port 8140: Connection refused
[14:47:15] <icinga-wm_>	 PROBLEM - Check size of conntrack table on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds
[14:47:15] <icinga-wm_>	 ACKNOWLEDGEMENT - Apache HTTP on mw1228 is CRITICAL: connect to address 10.64.48.63 and port 80: Connection refused Muehlenhoff T168613
[14:47:15] <icinga-wm_>	 ACKNOWLEDGEMENT - Check size of conntrack table on mw1228 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T168613
[14:47:15] <icinga-wm_>	 ACKNOWLEDGEMENT - Check systemd state on mw1228 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T168613
[14:47:15] <icinga-wm_>	 ACKNOWLEDGEMENT - Check the NTP synchronisation status of timesyncd on mw1228 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T168613
[14:47:15] <icinga-wm_>	 ACKNOWLEDGEMENT - Check whether ferm is active by checking the default input chain on mw1228 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T168613
[14:47:15] <icinga-wm_>	 ACKNOWLEDGEMENT - DPKG on mw1228 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T168613
[14:47:16] <icinga-wm_>	 ACKNOWLEDGEMENT - Disk space on mw1228 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T168613
[14:47:16] <icinga-wm_>	 ACKNOWLEDGEMENT - HHVM processes on mw1228 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T168613
[14:47:17] <icinga-wm_>	 ACKNOWLEDGEMENT - HHVM rendering on mw1228 is CRITICAL: connect to address 10.64.48.63 and port 80: Connection refused Muehlenhoff T168613
[14:47:17] <icinga-wm_>	 ACKNOWLEDGEMENT - IPMI Temperature on mw1228 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T168613
[14:47:18] <icinga-wm_>	 ACKNOWLEDGEMENT - Nginx local proxy to apache on mw1228 is CRITICAL: connect to address 10.64.48.63 and port 443: Connection refused Muehlenhoff T168613
[14:47:35] <icinga-wm_>	 PROBLEM - Check systemd state on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds
[14:47:35] <icinga-wm_>	 PROBLEM - dhclient process on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds
[14:47:45] <icinga-wm_>	 PROBLEM - kartotherian endpoints health on maps-test2001 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 400 (expecting:
[14:47:45] <icinga-wm_>	 }/{y}@{scale}x.{format} (default scaled tile) is CRITICAL: Test default scaled tile returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-intl) is CRITICAL: Test tile service info for osm-intl returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-pbf) is CRITICAL: Test tile service info for osm-pbf returned the unexpected status 400 (expectin
[14:47:45] <icinga-wm_>	 PROBLEM - MD RAID on lvs3001 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0
[14:47:46] <icinga-wm_>	 ACKNOWLEDGEMENT - MD RAID on lvs3001 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T169993
[14:47:47] <icinga-wm_>	 PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 116 connecting: cp3009_v4, cp3009_v6
[14:47:47] <icinga-wm_>	 PROBLEM - mediawiki-installation DSH group on mw2148 is CRITICAL: Host mw2148 is not in mediawiki-installation dsh group
[14:47:47] <icinga-wm_>	 PROBLEM - Check the NTP synchronisation status of timesyncd on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds
[14:47:47] <icinga-wm_>	 PROBLEM - puppet last run on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds
[14:47:50] <wikibugs>	 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T169993#3415683 (10ops-monitoring-bot)
[14:48:05] <icinga-wm_>	 PROBLEM - kartotherian endpoints health on maps-test2002 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 400 (expecting:
[14:48:05] <icinga-wm_>	 }/{y}@{scale}x.{format} (default scaled tile) is CRITICAL: Test default scaled tile returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-intl) is CRITICAL: Test tile service info for osm-intl returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-pbf) is CRITICAL: Test tile service info for osm-pbf returned the unexpected status 400 (expectin
[14:48:05] <icinga-wm_>	 PROBLEM - Check whether ferm is active by checking the default input chain on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds
[14:48:06] <icinga-wm_>	 PROBLEM - restbase endpoints health on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds
[14:48:15] <icinga-wm_>	 PROBLEM - HTTPS-eventdonations on eventdonations.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Name or service not known
[14:48:15] <icinga-wm_>	 PROBLEM - Apache HTTP on mw2148 is CRITICAL: connect to address 10.192.32.36 and port 80: Connection refused
[14:48:15] <icinga-wm_>	 PROBLEM - kartotherian endpoints health on maps-test2003 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 400 (expecting:
[14:48:15] <icinga-wm_>	 }/{y}@{scale}x.{format} (default scaled tile) is CRITICAL: Test default scaled tile returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-intl) is CRITICAL: Test tile service info for osm-intl returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-pbf) is CRITICAL: Test tile service info for osm-pbf returned the unexpected status 400 (expectin
[14:48:16] <icinga-wm_>	 PROBLEM - nutcracker port on mw1228 is CRITICAL: Return code of 255 is out of bounds
[14:48:16] <icinga-wm_>	 PROBLEM - DPKG on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds
[14:48:16] <icinga-wm_>	 PROBLEM - salt-minion processes on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds
[14:48:17] <icinga-wm_>	 PROBLEM - Restbase root url on restbase2001 is CRITICAL: connect to address 10.192.16.152 and port 7231: Connection refused
[14:48:35] <icinga-wm_>	 PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 116 connecting: cp3009_v4, cp3009_v6
[14:48:53] <icinga-wm_>	 ACKNOWLEDGEMENT - Check size of conntrack table on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T169696
[14:48:53] <icinga-wm_>	 ACKNOWLEDGEMENT - Check systemd state on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T169696
[14:48:53] <icinga-wm_>	 ACKNOWLEDGEMENT - Check the NTP synchronisation status of timesyncd on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T169696
[14:48:53] <icinga-wm_>	 ACKNOWLEDGEMENT - Check whether ferm is active by checking the default input chain on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T169696
[14:48:53] <icinga-wm_>	 ACKNOWLEDGEMENT - DPKG on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T169696
[14:48:53] <icinga-wm_>	 ACKNOWLEDGEMENT - Disk space on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T169696
[14:48:53] <icinga-wm_>	 ACKNOWLEDGEMENT - IPMI Temperature on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T169696
[14:48:54] <icinga-wm_>	 ACKNOWLEDGEMENT - MD RAID on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T169696
[14:48:54] <icinga-wm_>	 ACKNOWLEDGEMENT - MegaRAID on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T169696
[14:48:55] <icinga-wm_>	 ACKNOWLEDGEMENT - Restbase root url on restbase-dev1003 is CRITICAL: connect to address 10.64.48.46 and port 7231: Connection refused Muehlenhoff T169696
[14:48:55] <icinga-wm_>	 ACKNOWLEDGEMENT - SSH on restbase-dev1003 is CRITICAL: connect to address 10.64.48.46 and port 22: Connection refused Muehlenhoff T169696
[14:48:56] <icinga-wm_>	 ACKNOWLEDGEMENT - cassandra-a CQL 10.64.48.117:9042 on restbase-dev1003 is CRITICAL: connect to address 10.64.48.117 and port 9042: Connection refused Muehlenhoff T169696
[14:49:25] <icinga-wm_>	 PROBLEM - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 116 connecting: cp3009_v4, cp3009_v6
[14:50:05] <icinga-wm_>	 RECOVERY - salt-minion processes on ms-fe3002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[14:50:05] <icinga-wm_>	 RECOVERY - salt-minion processes on ms-fe3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[14:50:14] <wikibugs>	 10Operations: postgresql::ganglia on puppetdb servers - authentication failed - https://phabricator.wikimedia.org/T169953#3415687 (10Dzahn) a:03Dzahn thanks @akosiaris  gotcha!
[14:50:16] <icinga-wm_>	 PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 116 connecting: cp3009_v4, cp3009_v6
[14:50:35] <icinga-wm_>	 RECOVERY - Check systemd state on ms-fe3002 is OK: OK - running: The system is fully operational
[14:51:05] <icinga-wm_>	 PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 116 connecting: cp3009_v4, cp3009_v6
[14:51:35] <icinga-wm_>	 PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 116 connecting: cp3009_v4, cp3009_v6
[14:51:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] Recommendation API: Add the beta scap source [puppet] - 10https://gerrit.wikimedia.org/r/360686 (https://phabricator.wikimedia.org/T165760) (owner: 10Mobrovac)
[14:52:25] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labtestservices2003.wikimedia.org - https://phabricator.wikimedia.org/T168893#3379811 (10MoritzMuehlenhoff) The server has been installed with a public IP, but not added to site.pp, so there's currently no ferm rules enabled. Pl...
[14:53:14] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labtestservices2002.wikimedia.org - https://phabricator.wikimedia.org/T168892#3379794 (10MoritzMuehlenhoff) The server has been installed with a public IP, but not added to site.pp, so there's currently no ferm rules enabled. Pl...
[14:53:15] <icinga-wm_>	 PROBLEM - puppet last run on db1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:53:31] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labtestcontrol2003.wikimedia.org - https://phabricator.wikimedia.org/T168894#3379828 (10MoritzMuehlenhoff) The server has been installed with a public IP, but not added to site.pp, so there's currently no ferm rules enabled. Ple...
[14:53:38] <marostegui>	 How is that page possible, the check is silenced
[14:54:00] <marostegui>	 Ah, right, it is an old one
[14:55:18] <jynus>	 downtimes do not avoid pages
[14:55:29] <jynus>	 I do not know why people continue thinking they do
[14:58:05] <icinga-wm_>	 PROBLEM - puppet last run on db1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:58:25] <icinga-wm_>	 PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:59:14] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10Performance-Team, 10Thumbor, 10Patch-For-Review: Beta thumbnails are broken - https://phabricator.wikimedia.org/T169114#3387489 (10hashar) >>! In T169114#3414879, @fgiunchedi wrote: > Looks like this is fixed, we don't have poolcounter in beta I think? Anywa...
[14:59:15] <icinga-wm_>	 PROBLEM - puppet last run on db1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:59:55] <icinga-wm_>	 PROBLEM - puppet last run on pc2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:00:35] <icinga-wm_>	 PROBLEM - puppet last run on db1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:00:35] <icinga-wm_>	 PROBLEM - puppet last run on db2010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:00:36] <dcausse>	 !log deleting commonswiki_file_1499379383 on elastic@eqiad (failed reindex)
[15:00:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:01:25] <icinga-wm_>	 PROBLEM - puppet last run on db2011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:01:35] <icinga-wm_>	 PROBLEM - puppet last run on silver is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:01:35] <icinga-wm_>	 PROBLEM - puppet last run on pc1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:01:55] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Switch db1102 role from sanitarium3->dbstore_multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/363204 (https://phabricator.wikimedia.org/T169514)
[15:01:57] <wikibugs>	 (03PS1) 10Jcrespo: Fix service for hosts with a default package (fwup. f13be9f5a2949f) [puppet] - 10https://gerrit.wikimedia.org/r/363840 (https://phabricator.wikimedia.org/T169514)
[15:02:27] <icinga-wm_>	 PROBLEM - puppet last run on db2030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:02:43] <marostegui>	 jynus: I was going to upgrade db1102 to 10.1, you want to take it for your tests?
[15:02:46] <marostegui>	 It is not urgent
[15:02:48] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Fix service for hosts with a default package [puppet] - 10https://gerrit.wikimedia.org/r/363840 (https://phabricator.wikimedia.org/T169514)
[15:03:05] <icinga-wm_>	 PROBLEM - puppet last run on pc1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:03:10] <jynus>	 I was going to use another host
[15:03:15] <icinga-wm_>	 PROBLEM - puppet last run on pc2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:03:15] <marostegui>	 jynus: Ah cool then :)
[15:03:28] <jynus>	 are you going to upgrade it anyway?
[15:03:39] <marostegui>	 yeah to make it like db1095 
[15:03:51] <jynus>	 oh
[15:03:59] <jynus>	 you mean mariadb
[15:04:03] <marostegui>	 sorry, yes :)
[15:04:07] <jynus>	 I thought you meant the os
[15:04:15] <marostegui>	 no no, jessie + 10.1
[15:04:27] <jynus>	 do as it is best for you
[15:04:34] <jynus>	 no opinion there
[15:04:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Fix service for hosts with a default package [puppet] - 10https://gerrit.wikimedia.org/r/363840 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo)
[15:04:41] <marostegui>	 cool, I will do it then if you are not going to use it for your tests
[15:04:45] <icinga-wm_>	 PROBLEM - puppet last run on db2012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:04:47] <marostegui>	 thank you!
[15:05:46] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Fix service for hosts with a default package [puppet] - 10https://gerrit.wikimedia.org/r/363840 (https://phabricator.wikimedia.org/T169514)
[15:06:35] <icinga-wm_>	 PROBLEM - puppet last run on pc1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:07:08] <marostegui>	 !log Stop MySQL on db1102 for MariaDB upgrade
[15:07:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:07:29] <wikibugs>	 (03PS4) 10Jcrespo: mariadb: Fix service for hosts with a default package [puppet] - 10https://gerrit.wikimedia.org/r/363840 (https://phabricator.wikimedia.org/T169514)
[15:09:05] <icinga-wm_>	 PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:09:25] <icinga-wm_>	 PROBLEM - puppet last run on labsdb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:09:33] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Fix service for hosts with a default package [puppet] - 10https://gerrit.wikimedia.org/r/363840 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo)
[15:09:35] <icinga-wm_>	 PROBLEM - puppet last run on pc2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:09:45] <icinga-wm_>	 PROBLEM - puppet last run on labsdb1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:10:05] <icinga-wm_>	 PROBLEM - puppet last run on db1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:10:25] <icinga-wm_>	 PROBLEM - puppet last run on dbstore1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:12:04] <wikibugs>	 10Operations, 10DBA, 10Scoring-platform-team, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3415739 (10Halfak) Announcements have been updated.  Thanks for the note.   Shall we always announce a 1 hour maintenance window for DB maintenance?
[15:13:23] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): CI for operations/puppet is taking too long - https://phabricator.wikimedia.org/T166888#3415741 (10hashar) a:05hashar>03None **Status update**  There are a few patches for puppet.git that are...
[15:15:16] <wikibugs>	 10Operations, 10DBA, 10Scoring-platform-team, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3415746 (10jcrespo) It varies from maintanance to maintenance, depending on the work to be done. Some take more some take less- the "normally" was meant as "N...
[15:16:07] <wikibugs>	 10Operations, 10Patch-For-Review: Ops Onboarding for Keith Herron - https://phabricator.wikimedia.org/T166587#3415747 (10herron) Received an alert today via the email to sms gateway.  Is this the expected behavior, or should the alert have been sent directly via SMS?
[15:16:26] <wikibugs>	 10Operations: Ops Onboarding for Keith Herron - https://phabricator.wikimedia.org/T166587#3415748 (10herron)
[15:17:03] <wikibugs>	 (03PS1) 10Rush: labtest: new public servers add base + firewall [puppet] - 10https://gerrit.wikimedia.org/r/363843 (https://phabricator.wikimedia.org/T168893)
[15:22:14] <wikibugs>	 (03CR) 10Rush: [C: 032] labtest: new public servers add base + firewall [puppet] - 10https://gerrit.wikimedia.org/r/363843 (https://phabricator.wikimedia.org/T168893) (owner: 10Rush)
[15:22:21] <icinga-wm_>	 RECOVERY - mysqld processes on db1102 is OK: PROCS OK: 7 processes with command name mysqld
[15:24:04] <jynus>	 7?
[15:24:23] <_joe_>	  melius abundare quam deficiere
[15:24:29] <jynus>	 lol
[15:26:00] <wikibugs>	 10Operations, 10Epic, 10Goal, 10Services (doing), and 2 others: Services Q1 2017/18 goal: Begin migrating job queue processing to multi-DC enabled eventbus infrastructure. - https://phabricator.wikimedia.org/T169937#3413824 (10mobrovac)
[15:26:35] <icinga-wm_>	 PROBLEM - Check HHVM threads for leakage on mw2148 is CRITICAL: NRPE: Command check_check_leaked_hhvm_threads not defined
[15:27:03] <jynus>	 It reminds me of the famous quote: "Don't create too many checks on icinga, or you may regret it in the future" --abraham lincoln
[15:27:15] <Zppix>	 jynus:  abe said that?
[15:27:20] <Zppix>	 jynus:  what a smart man
[15:30:35] <jynus>	 "Nearly all men can stand adversity, but if you want to test a man's character, give him root access."
[15:30:52] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Define variables before they are used on service.pp [puppet] - 10https://gerrit.wikimedia.org/r/363845 (https://phabricator.wikimedia.org/T169514)
[15:31:05] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Define variables before they are used on service.pp [puppet] - 10https://gerrit.wikimedia.org/r/363845 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo)
[15:32:04] <herron>	 four score and seven alerts ago
[15:32:22] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Switch db1102 role from sanitarium3->dbstore_multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/363204 (https://phabricator.wikimedia.org/T169514)
[15:32:24] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Define variables before they are used on service.pp [puppet] - 10https://gerrit.wikimedia.org/r/363845 (https://phabricator.wikimedia.org/T169514)
[15:34:30] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Define variables before they are used on service.pp [puppet] - 10https://gerrit.wikimedia.org/r/363845 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo)
[15:34:38] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Define variables before they are used on service.pp [puppet] - 10https://gerrit.wikimedia.org/r/363845 (https://phabricator.wikimedia.org/T169514)
[15:35:42] <wikibugs>	 10Operations, 10Wikimedia-Stream: rcstream service - gevent dependency incompatibility - https://phabricator.wikimedia.org/T153773#3415801 (10Aklapper) 05Open>03declined Closing this task as "declined" as RCStream is deprecated and scheduled to be shut down today. See T156919 for more information.
[15:35:44] <wikibugs>	 10Operations, 10Wikimedia-Stream: Upstream prematurely closed connection - https://phabricator.wikimedia.org/T153772#3415805 (10Aklapper) 05Open>03declined Closing this task as "declined" as RCStream is deprecated and scheduled to be shut down today. See T156919 for more information.
[15:35:46] <wikibugs>	 10Operations, 10Wikimedia-Stream: Error on RCStream server startup for the "flash policy server" - https://phabricator.wikimedia.org/T153770#3415809 (10Aklapper) 05Open>03declined Closing this task as "declined" as RCStream is deprecated and scheduled to be shut down today. See T156919 for more information.
[15:39:47] <wikibugs>	 10Operations, 10DBA, 10Scoring-platform-team, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3415818 (10Halfak)
[15:40:55] <wikibugs>	 10Operations, 10DBA, 10Scoring-platform-team, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3369055 (10Halfak)
[15:41:04] <wikibugs>	 10Operations, 10DBA, 10Scoring-platform-team, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3369055 (10Halfak) Gotcha.  Next time, we should add these details to the task description and I'll pick them up from there when making announcement.  :)  In...
[15:51:40] <icinga-wm_>	 RECOVERY - puppet last run on db1020 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures
[15:52:10] <icinga-wm_>	 RECOVERY - puppet last run on db1011 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures
[15:52:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] Add diffscan module. [puppet] - 10https://gerrit.wikimedia.org/r/363296 (https://phabricator.wikimedia.org/T169624) (owner: 10Ayounsi)
[15:56:00] <icinga-wm_>	 RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures
[15:56:40] <icinga-wm_>	 RECOVERY - puppet last run on db1016 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures
[15:57:41] <icinga-wm_>	 RECOVERY - puppet last run on db1046 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[15:58:30] <icinga-wm_>	 RECOVERY - puppet last run on pc2004 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[15:59:10] <icinga-wm_>	 RECOVERY - puppet last run on pc1004 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures
[15:59:11] <icinga-wm_>	 RECOVERY - puppet last run on db1009 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures
[16:00:00] <icinga-wm_>	 RECOVERY - puppet last run on db2011 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures
[16:00:10] <icinga-wm_>	 RECOVERY - puppet last run on silver is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures
[16:00:20] <icinga-wm_>	 RECOVERY - puppet last run on db2010 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures
[16:01:00] <icinga-wm_>	 RECOVERY - puppet last run on db2030 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures
[16:01:30] <icinga-wm_>	 RECOVERY - puppet last run on pc1005 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures
[16:01:40] <icinga-wm_>	 RECOVERY - puppet last run on pc2005 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[16:03:20] <icinga-wm_>	 RECOVERY - puppet last run on db2012 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures
[16:05:00] <icinga-wm_>	 RECOVERY - puppet last run on pc1006 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[16:07:30] <icinga-wm_>	 RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures
[16:07:40] <icinga-wm_>	 RECOVERY - puppet last run on labsdb1001 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures
[16:08:20] <icinga-wm_>	 RECOVERY - puppet last run on pc2006 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures
[16:08:30] <icinga-wm_>	 RECOVERY - puppet last run on labsdb1005 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures
[16:08:31] <icinga-wm_>	 RECOVERY - puppet last run on db1001 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[16:08:40] <icinga-wm_>	 RECOVERY - puppet last run on dbstore1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[16:10:30] <icinga-wm_>	 RECOVERY - puppet last run on db1043 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures
[16:13:40] <icinga-wm_>	 RECOVERY - puppet last run on analytics1003 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures
[16:13:40] <icinga-wm_>	 RECOVERY - puppet last run on dbstore2001 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures
[16:13:40] <icinga-wm_>	 RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures
[16:13:50] <icinga-wm_>	 RECOVERY - puppet last run on db1048 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures
[16:14:50] <icinga-wm_>	 RECOVERY - puppet last run on db1047 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures
[16:15:30] <icinga-wm_>	 RECOVERY - puppet last run on dbstore1001 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures
[16:17:51] <wikibugs>	 (03PS1) 10Milimetric: Use parallelism to sqoop large tables [puppet] - 10https://gerrit.wikimedia.org/r/363846 (https://phabricator.wikimedia.org/T169782)
[16:32:39] <wikibugs>	 (03Draft1) 10Paladox: DO NOT MERGE [labs/private] - 10https://gerrit.wikimedia.org/r/363847
[16:32:41] <wikibugs>	 (03PS2) 10Paladox: DO NOT MERGE [labs/private] - 10https://gerrit.wikimedia.org/r/363847
[17:02:53] <bd808>	 !bash <    jynus> It reminds me of the famous quote: "Don't create too many checks on icinga, or you may regret it in the future" --abraham lincoln
[17:02:53] <stashbot>	 bd808: Stored quip at https://tools.wmflabs.org/bash/quip/AV0eARQxU4b8yJAIAfBE
[17:03:51] <greg-g>	 lol
[17:11:51] <jynus>	 spoiler- the quote is actually fake
[17:11:55] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): CI for operations/puppet is taking too long - https://phabricator.wikimedia.org/T166888#3416029 (10faidon) FTR, as I mentioned on IRC, these three changes are continuing down the path of accumula...
[17:17:35] <wikibugs>	 (03CR) 10C. Scott Ananian: [C: 031] "I don't have C+2 rights to puppet. (I used to, though?)" [puppet] - 10https://gerrit.wikimedia.org/r/363045 (owner: 10Mobrovac)
[17:24:49] <wikibugs>	 (03CR) 10Mforns: [C: 031] "LGTM!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/363846 (https://phabricator.wikimedia.org/T169782) (owner: 10Milimetric)
[17:25:08] <wikibugs>	 (03CR) 10Jcrespo: "I think that should get it from being a dbstore- we have analytics nodes, dbstores, and dbstores that are also analytics nodes. I will che" [puppet] - 10https://gerrit.wikimedia.org/r/356648 (owner: 10Jcrespo)
[17:25:13] <wikibugs>	 (03PS1) 10Umherirrender: Add ar_content_format and ar_content_model to labs views [puppet] - 10https://gerrit.wikimedia.org/r/363851 (https://phabricator.wikimedia.org/T89741)
[17:38:10] <wikibugs>	 (03PS4) 10Paladox: Gerrit: Add gerrit pub key for ssh [labs/private] - 10https://gerrit.wikimedia.org/r/363755
[17:43:38] <wikibugs>	 (03PS10) 10Paladox: WIP: Gerrit: Add support for scap [puppet] - 10https://gerrit.wikimedia.org/r/363726 (https://phabricator.wikimedia.org/T157414)
[17:44:06] <wikibugs>	 (03PS11) 10Paladox: Gerrit: Add support for scap [puppet] - 10https://gerrit.wikimedia.org/r/363726 (https://phabricator.wikimedia.org/T157414)
[17:46:31] <wikibugs>	 (03CR) 10Paladox: Gerrit: Add support for scap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/363726 (https://phabricator.wikimedia.org/T157414) (owner: 10Paladox)
[17:53:00] <wikibugs>	 10Operations, 10ops-eqiad, 10Cloud-Services, 10Cloud-VPS, and 2 others: rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531#3416130 (10chasemp) a:05Cmjohnson>03chasemp I'll try to take care of this in the am mon or tue
[18:13:20] <wikibugs>	 (03PS1) 10Dzahn: planet: add 3 new feeds to English planet [puppet] - 10https://gerrit.wikimedia.org/r/363857
[18:14:19] <mutante>	 needs a bot that creates phab tickets when a wiki page changes
[18:14:32] <mutante>	 but only that one section on the page :p
[18:15:22] <wikibugs>	 (03CR) 10Paladox: [C: 031] planet: add 3 new feeds to English planet [puppet] - 10https://gerrit.wikimedia.org/r/363857 (owner: 10Dzahn)
[18:17:05] <wikibugs>	 (03CR) 10Dzahn: [C: 032] planet: add 3 new feeds to English planet [puppet] - 10https://gerrit.wikimedia.org/r/363857 (owner: 10Dzahn)
[18:17:51] <wikibugs>	 (03PS2) 10Dzahn: planet: add 3 new feeds to English planet [puppet] - 10https://gerrit.wikimedia.org/r/363857
[18:18:07] <wikibugs>	 (03PS1) 10BryanDavis: toolforge: Remove SCAN redis command [puppet] - 10https://gerrit.wikimedia.org/r/363858 (https://phabricator.wikimedia.org/T169957)
[18:18:15] <bd808>	 chasemp: ^
[18:18:17] <wikibugs>	 (03PS3) 10Dzahn: planet: add 3 new feeds to English planet [puppet] - 10https://gerrit.wikimedia.org/r/363857
[18:19:08] <wikibugs>	 (03CR) 10Rush: [V: 032 C: 032] toolforge: Remove SCAN redis command [puppet] - 10https://gerrit.wikimedia.org/r/363858 (https://phabricator.wikimedia.org/T169957) (owner: 10BryanDavis)
[18:19:30] <chasemp>	 bd808: done
[18:22:19] <wikibugs>	 (03PS4) 10Dzahn: planet: add 3 new feeds to English planet [puppet] - 10https://gerrit.wikimedia.org/r/363857
[18:23:17] <wikibugs>	 10Operations, 10Mail, 10fundraising-tech-ops: (re)move problemsdonating aliases - https://phabricator.wikimedia.org/T127488#3416411 (10eross) @Dzahn From the ticket on Zendesk it stated problemsdonate@ but I can change if it needs to be problemsdonating@ however, problemsdonating@ is  already created as an u...
[19:09:39] <mutante>	 !summon wikibugs
[19:14:08] <Sagan>	 mutante: looks like it needs a restart
[19:14:13] <Sagan>	 [20:29:05] * wikibugs (tools.wiki@wikimedia/bot/pywikibugs) quit (Ping timeout: 260 seconds)
[19:14:42] <RainbowSprinkles>	 You know what would be cool? A bot restarting bot
[19:14:54] <Reedy>	 Who restarts the bot restarting bot?
[19:15:10] <mutante>	 that is what i meant to imply :)
[19:15:16] <mutante>	 by making up that command, heh
[19:15:28] <RainbowSprinkles>	 Hehehe
[19:16:50] <mutante>	 maybe i will do it some day when i get to https://gerrit.wikimedia.org/r/#/c/320698/  :)
[19:16:53] <RainbowSprinkles>	 mutante: Where did we leave off on finishing moving releases.wm.o off bromine?
[19:17:08] <mutante>	 since eggdrop = rock stable , hehehe
[19:17:30] <mutante>	 RainbowSprinkles: last thing i did was make the rsync work and confirmed the files were there
[19:17:39] <mutante>	 we didnt switch it yet because of the upload part
[19:17:41] <RainbowSprinkles>	 That's what I thought
[19:17:48] <RainbowSprinkles>	 Yeah, that's the remaining bit, upload + dns
[19:17:59] <mutante>	 if it wasnt for the upload, i would change DNS now 
[19:18:12] <mutante>	 but that part was always so tricky and i had like at least 2 long debug sessions with subbu
[19:18:16] <mutante>	 on the _existing_ setup :p
[19:18:21] * RainbowSprinkles nods
[20:17:29] <wikibugs>	 10Operations, 10RESTBase, 10RESTBase-API, 10Traffic, 10Services (next): RESTBase support for www.wikimedia.org missing - https://phabricator.wikimedia.org/T133178#3416713 (10Krinkle) >>! In T102178#3402913, @Krinkle wrote: > @GWicke At which point was wikimedia.org (or www.wikimedia.org?) a wiki? Assumin...
[20:17:43] <wikibugs>	 10Operations, 10Cloud-Services, 10RESTBase, 10Services, and 3 others: Fix RESTBase support for wikitech.wikimedia.org - https://phabricator.wikimedia.org/T102178#1358396 (10Krinkle) >>! In T102178#3403100, @GWicke wrote: > @krinkle, your comment sounds like it might have been intended for {T133178}.  Indee...
[20:32:05] <legoktm>	 greg-g: is it OK if I backport the patch for https://phabricator.wikimedia.org/T169261 ?
[20:33:17] <wikibugs>	 (03CR) 10ZZhou (WMF): "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/363884 (owner: 10Dzahn)
[20:39:37] <paladox>	 hi, is there a dns problem?
[20:39:46] <paladox>	 pinging has started failing for labs instances
[20:40:32] <paladox>	 things are recoverying now.
[20:46:17] <wikibugs>	 (03PS7) 10Andrew Bogott: Refactor puppetmaster roles into role/profiles [puppet] - 10https://gerrit.wikimedia.org/r/363870
[20:47:34] <wikibugs>	 10Operations, 10Services (done): Set up grafana alerting for services - https://phabricator.wikimedia.org/T162765#3416834 (10GWicke) See also https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org#Alerts_.28with_notifications_via_Icinga.29 for some documentation on the topic by @halfak and myself.
[20:48:10] <wikibugs>	 (03CR) 10Paladox: "> still per my comment on PS2, the pub part goes in the public repo," [labs/private] - 10https://gerrit.wikimedia.org/r/363755 (owner: 10Paladox)
[20:48:32] <wikibugs>	 (03CR) 10Paladox: "Following how it was done for phabricator." [labs/private] - 10https://gerrit.wikimedia.org/r/363755 (owner: 10Paladox)
[21:02:47] <wikibugs>	 (03CR) 10Dzahn: ">What do you mean that the private key goes into the secret?" [labs/private] - 10https://gerrit.wikimedia.org/r/363755 (owner: 10Paladox)
[21:02:58] <wikibugs>	 (03PS8) 10Andrew Bogott: Refactor puppetmaster roles into role/profiles [puppet] - 10https://gerrit.wikimedia.org/r/363870
[21:03:43] <wikibugs>	 (03CR) 10Dzahn: ">Following how it was done for phabricator." [labs/private] - 10https://gerrit.wikimedia.org/r/363755 (owner: 10Paladox)
[21:03:47] <wikibugs>	 (03CR) 10Paladox: "> >What do you mean that the private key goes into the secret?" [labs/private] - 10https://gerrit.wikimedia.org/r/363755 (owner: 10Paladox)
[21:04:47] <wikibugs>	 (03CR) 10Dzahn: ">storing both keys under secret" [labs/private] - 10https://gerrit.wikimedia.org/r/363755 (owner: 10Paladox)
[21:05:09] <greg-g>	 legoktm: sorry for the late response, but yes
[21:11:54] <wikibugs>	 10Operations, 10MediaWiki-JobQueue, 10monitoring: Establish monitoring thresholds for job queue - https://phabricator.wikimedia.org/T79687#3416884 (10Krinkle)
[21:11:57] <wikibugs>	 10Operations, 10MediaWiki-JobQueue, 10monitoring, 10Patch-For-Review: Redis monitoring needs to be improved - https://phabricator.wikimedia.org/T133179#3416882 (10Krinkle)
[21:12:02] <Zppix>	 Hello, Wikimedia AI team is wondering how we would send celery logs and events to logstash on prod and labs
[21:12:18] <wikibugs>	 10Operations, 10MediaWiki-JobQueue, 10monitoring: Redis monitoring needs to be improved - https://phabricator.wikimedia.org/T133179#2224324 (10Krinkle)
[21:12:35] <wikibugs>	 (03PS1) 10Rush: openstack: add wikitech-grep as utility for adminscripts [puppet] - 10https://gerrit.wikimedia.org/r/363896 (https://phabricator.wikimedia.org/T169820)
[21:17:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: add wikitech-grep as utility for adminscripts [puppet] - 10https://gerrit.wikimedia.org/r/363896 (https://phabricator.wikimedia.org/T169820) (owner: 10Rush)
[21:21:39] <wikibugs>	 (03CR) 10BryanDavis: openstack: add wikitech-grep as utility for adminscripts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/363896 (https://phabricator.wikimedia.org/T169820) (owner: 10Rush)
[21:22:18] <wikibugs>	 (03CR) 10Dzahn: [C: 031] use 'require_package' for stats packages including python-yaml [puppet] - 10https://gerrit.wikimedia.org/r/363382 (owner: 10ArielGlenn)
[21:26:54] <wikibugs>	 (03CR) 10Chad: [C: 04-1] "Forking mwgrep is a terrible idea." [puppet] - 10https://gerrit.wikimedia.org/r/363896 (https://phabricator.wikimedia.org/T169820) (owner: 10Rush)
[21:27:47] <wikibugs>	 (03PS2) 10Rush: openstack: add wikitech-grep as utility for adminscripts [puppet] - 10https://gerrit.wikimedia.org/r/363896 (https://phabricator.wikimedia.org/T169820)
[21:28:09] <wikibugs>	 (03CR) 10Chad: [C: 04-1] openstack: add wikitech-grep as utility for adminscripts [puppet] - 10https://gerrit.wikimedia.org/r/363896 (https://phabricator.wikimedia.org/T169820) (owner: 10Rush)
[21:28:13] <wikibugs>	 (03Abandoned) 10Paladox: servermon: Make sure /etc/gunicorn.d/ exists [puppet] - 10https://gerrit.wikimedia.org/r/362601 (owner: 10Paladox)
[21:28:14] <chasemp>	 RainbowSprinkles: can you help me understand? Forking mwgrep is a terrible idea.
[21:28:18] <chasemp>	 honestly I just don't know
[21:28:29] <RainbowSprinkles>	 Why wouldn't you just add like extra options to the original mwgrep?
[21:28:54] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: add wikitech-grep as utility for adminscripts [puppet] - 10https://gerrit.wikimedia.org/r/363896 (https://phabricator.wikimedia.org/T169820) (owner: 10Rush)
[21:29:08] <RainbowSprinkles>	 jernkins-bot agrees, but probably for different reasons :p
[21:29:10] <chasemp>	 I'm not sure why bryan did it this way, maybe to cut down on clutter for a specific use case
[21:29:20] <RainbowSprinkles>	 Silly.
[21:29:37] <chasemp>	 bd808: do you care to pursue either wikitech-grep or modifying mwgrep?
[21:29:39] <RainbowSprinkles>	 Like, if you want to limit mwgrep to one (or more) wikis, just add like a --wiki parameter
[21:29:58] <chasemp>	 I mean, it's a utility taht no one is asking you to use and I don't feel that strongly about it honestly
[21:30:02] <chasemp>	 but we are going to keep using it I think
[21:30:08] <RainbowSprinkles>	 If you want to use different namespaces, expand the ability from --user and such to be more namespace-agnostic
[21:30:09] <chasemp>	 it just won't be inpuppet
[21:30:44] <bd808>	 sigh
[21:30:51] <RainbowSprinkles>	 I mean if you want to use unpuppetized stuff that's on you, I just think forking within puppet is a bad idea.
[21:30:53] <chasemp>	 my point is the outcome of an in-use thing being in puppet and findable so that could happen is better than the current situation
[21:30:56] <bd808>	 this is why it was just in ~bd808
[21:31:08] <chasemp>	 ok fair, I'll abandon and keep it on the sly
[21:31:15] <bd808>	 making mwgrep better would be great
[21:31:21] <chasemp>	 it seems like a silly argument 
[21:31:47] <bd808>	 getting rid of mwgrep and having a cirrussearch replica cluster would be better
[21:32:26] <RainbowSprinkles>	 Talk to discovery and cloud services about that.
[21:32:33] <bd808>	 arguing about a 100 line python wrapper around the elasticsearch api is not worth anyone's time
[21:32:34] <Zppix>	 (reposting question as it probably got lost in scrollback) Does anyone know how Wiki-ai can log celery logs and events to logstash?
[21:32:51] <chasemp>	 I haven't used mwgrep so I'm not sure about it 
[21:32:53] <bd808>	 RainbowSprinkles: it was in my hardware ask. it just didn't make the final cut
[21:33:01] <RainbowSprinkles>	 bummer :(
[21:33:11] <bd808>	 we will get it done
[21:33:16] <chasemp>	 RainbowSprinkles: is this a thing you feel like standing on? the -1
[21:33:36] <RainbowSprinkles>	 I've lived a productive life, I guess I found my hill to die on?
[21:33:41] <RainbowSprinkles>	 jk. Tempest/teapot
[21:33:41] <chasemp>	 you have thought about it more than I, I haven't used mwgrep in all honesty only this shel of a util
[21:33:44] <RainbowSprinkles>	 I just think it's dumb af
[21:33:53] <RainbowSprinkles>	 When you could add like a --wiki parameter to mwgrep
[21:33:57] <chasemp>	 that's fair, then make the changes you are requesting?
[21:34:24] <chasemp>	 I guess...it's cool you object honestly but why you care if we use this I don't understand
[21:34:47] <chasemp>	 it has no impact on anything or anyone
[21:35:10] <RainbowSprinkles>	 Because the existing tool works too :)
[21:35:15] <wikibugs>	 10Operations, 10Wikimedia-Site-requests: Update to interwiki map - https://phabricator.wikimedia.org/T169979#3416966 (10Zppix) IIRC ops have to run a script to upsate the interwiki map therefore adding the tag
[21:35:30] <chasemp>	 can I use it to search wikitech for instances of the string 'bigbrother'?
[21:35:37] <chasemp>	 honest question as I don't know
[21:35:40] <chasemp>	 and that's my most recent use case
[21:35:49] <bd808>	 no, its jsut for looking for js stuff
[21:35:56] <RainbowSprinkles>	 It can look at other things
[21:36:03] <RainbowSprinkles>	 --module searches NS_MODULE
[21:36:13] <RainbowSprinkles>	 But the code is versatile enough that a small tweak could make it general-purpose namespace
[21:36:25] <chasemp>	 sure but no one is going to do it or wants to
[21:36:56] <bd808>	 FWIW legoktm is chasting me in other places for making a cli tool at all
[21:37:23] <chasemp>	 it seems odd to me people care, what difference does it make to anyone else?
[21:38:47] <chasemp>	 jenkins does hate it as well 
[21:40:24] * RainbowSprinkles removes his -1, considers his objection heard
[21:41:07] <wikibugs>	 10Operations, 10Wikimedia-Site-requests: Update to interwiki map - https://phabricator.wikimedia.org/T169979#3415305 (10Dzahn) Do you mean running a script on a maintenance server?  There are > 55 deployers who can do that too (and often do), not just the few ops.
[21:42:00] <chasemp>	 RainbowSprinkles: I think I don't know enough history about mwgrep to know if this is a violation of standing arguments
[21:42:40] <wikibugs>	 10Operations, 10Mail, 10fundraising-tech-ops: (re)move problemsdonating aliases - https://phabricator.wikimedia.org/T127488#3416972 (10Dzahn) @eross Thank you Emerauld! Appreciate it. I will remove it on our side and close this ticket here. Yes, the rest can just be up to James A and Fundraising.
[21:42:53] <Zppix>	 mutante I typed that on mobile (my desktop wont connect due to internet on my end so thats why i summed it up)
[21:43:49] <RainbowSprinkles>	 chasemp: It's not about the history of mwgrep.
[21:43:58] <RainbowSprinkles>	 It's just that I think forking code is usually lame
[21:44:36] <chasemp>	 consdering I haven't used mwgrep I'm behind the curve
[21:45:23] <RainbowSprinkles>	 forking is almost never the answer imho. It's one of the things I dislike about Github
[21:45:34] <RainbowSprinkles>	 </RANT>
[21:45:47] * RainbowSprinkles steps out for some air
[21:46:04] <wikibugs>	 10Operations, 10Wikimedia-Site-requests: Update to interwiki map - https://phabricator.wikimedia.org/T169979#3416974 (10Zppix) >>! In T169979#3416969, @Dzahn wrote: > Do you mean running a script on a maintenance server?   If that is possible to do sometime, yes if i need to be there let me know and i will sho...
[21:47:57] <wikibugs>	 (03PS1) 10Chad: Updating interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363901 (https://phabricator.wikimedia.org/T169979)
[21:48:07] <wikibugs>	 (03CR) 10Chad: [C: 032] Updating interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363901 (https://phabricator.wikimedia.org/T169979) (owner: 10Chad)
[21:49:42] <wikibugs>	 (03Merged) 10jenkins-bot: Updating interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363901 (https://phabricator.wikimedia.org/T169979) (owner: 10Chad)
[21:49:52] <wikibugs>	 (03CR) 10jenkins-bot: Updating interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363901 (https://phabricator.wikimedia.org/T169979) (owner: 10Chad)
[21:50:14] <wikibugs>	 (03CR) 10Hashar: [C: 031] "Krinkle: correct. Though some jobs invoke SiteConfiguration::getConfig() which ends up shelling out :\  So indirectly jobs do rely on mwsc" [puppet] - 10https://gerrit.wikimedia.org/r/358896 (https://phabricator.wikimedia.org/T146285) (owner: 10Chad)
[21:52:39] <logmsgbot>	 !log demon@tin Synchronized wmf-config/interwiki.php: Updating interwiki cache, T169979 (duration: 00m 43s)
[21:52:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:52:53] <stashbot>	 T169979: Update to interwiki map - https://phabricator.wikimedia.org/T169979
[21:53:03] <wikibugs>	 10Operations, 10Wikimedia-Site-requests, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Update to interwiki map - https://phabricator.wikimedia.org/T169979#3416997 (10Zppix) a:03demon
[21:53:37] <RainbowSprinkles>	 Zppix: Please don't assign tasks to me next
[21:53:39] <RainbowSprinkles>	 time
[21:53:51] <Zppix>	 k
[21:54:44] <logmsgbot>	 !log legoktm@tin Synchronized php-1.30.0-wmf.7/extensions/CentralAuth/: Fix handling of password hash upgrade on login - T169261 (duration: 00m 45s)
[21:54:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:54:55] <stashbot>	 T169261: Users unable to remain logged in, associated with attempts to upgrade the password hash on every login - https://phabricator.wikimedia.org/T169261
[22:26:55] <wikibugs>	 10Operations, 10MediaWiki-JobRunner: Rationalize our jobqueues redis topology - https://phabricator.wikimedia.org/T135113#3417080 (10Krinkle)
[22:36:34] <wikibugs>	 10Operations, 10Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144#3417115 (10Dzahn)
[22:36:36] <wikibugs>	 10Operations, 10Mail, 10fundraising-tech-ops: (re)move problemsdonating aliases - https://phabricator.wikimedia.org/T127488#3417112 (10Dzahn) 05Open>03Resolved a:03Dzahn Removed on ops side.  I see that problems.donating , problemdonating and problem.donating and comentarios work in Google.  The other...
[22:40:21] <wikibugs>	 (03PS1) 10Chad: WIP: Simple wrapper around updating the interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363970
[23:39:42] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10HHVM: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#3417355 (10greg) >>! In T144006#2689359, @hashar wrote: > What is left is deployment-tmh01 which needs some packaging work for Jessie as I understood it.  That was Oct 2016 :...
[23:40:10] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10HHVM: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#3417359 (10greg)
[23:40:12] <wikibugs>	 10Operations, 10Multimedia, 10TimedMediaHandler, 10HHVM, 10Patch-For-Review: Migrate video scalers to jessie - https://phabricator.wikimedia.org/T145742#2639607 (10greg)