Wikidata Query Service

[00:22:31] <wikibugs_>	 (03CR) 10BearND: [C: 031] Deploy ReadingLists to the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384908 (https://phabricator.wikimedia.org/T174651) (owner: 10Gergő Tisza)
[00:37:42] <wikibugs_>	 (03CR) 10Paladox: [C: 031] Also use scap-deployed version of gerrit.war for actual running of gerrit [puppet] - 10https://gerrit.wikimedia.org/r/384760 (owner: 10Chad)
[00:46:32] <wikibugs_>	 (03PS3) 10Paladox: Gerrit: Replace certificates with tokens for its-phabricator [labs/private] - 10https://gerrit.wikimedia.org/r/384902 (https://phabricator.wikimedia.org/T178385)
[01:13:47] <wikibugs_>	 (03CR) 10VolkerE: [C: 031] "needs rebase" [puppet] - 10https://gerrit.wikimedia.org/r/381275 (owner: 10Krinkle)
[01:22:24] <icinga-wm>	 PROBLEM - HHVM rendering on mw2135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:23:15] <icinga-wm>	 RECOVERY - HHVM rendering on mw2135 is OK: HTTP OK: HTTP/1.1 200 OK - 73911 bytes in 0.313 second response time
[01:25:14] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0
[01:33:55] <icinga-wm>	 RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0
[01:34:15] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0
[01:58:42] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0
[01:59:23] <icinga-wm>	 PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0
[02:09:41] <wikibugs_>	 10Operations, 10Cloud-Services, 10DBA, 10Tracking: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930#3692629 (10bd808)
[02:09:53] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0
[02:22:03] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0
[02:22:12] <wikibugs_>	 10Operations, 10Cloud-Services, 10DBA, 10Tracking: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930#3692639 (10bd808)
[02:25:12] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0
[02:33:34] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.3) (duration: 08m 21s)
[02:33:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:36:02] <icinga-wm>	 RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0
[02:51:22] <icinga-wm>	 PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0
[02:51:32] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0
[03:07:12] <wikibugs_>	 10Operations, 10Collaboration-Team-Triage, 10Notifications, 10Anti-Harassment (AHT Sprint 7), 10Patch-For-Review: Recover Echo Notification Blacklist from Backup - https://phabricator.wikimedia.org/T178313#3692673 (10dbarratt) Thanks @Catrope for getting me access to the file.  Here's how to get everyone...
[03:09:58] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.4) (duration: 14m 58s)
[03:10:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:16:11] <wikibugs_>	 10Operations, 10Collaboration-Team-Triage, 10Notifications, 10Anti-Harassment (AHT Sprint 7), 10Patch-For-Review: Recover Echo Notification Blacklist from Backup - https://phabricator.wikimedia.org/T178313#3692690 (10dbarratt) To be fair... just skimming the values, there are a lot that are valid, but th...
[03:17:01] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Oct 18 03:17:01 UTC 2017 (duration 7m 3s)
[03:17:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:21:52] <icinga-wm>	 RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0
[03:21:53] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0
[03:27:33] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 866.10 seconds
[03:33:02] <icinga-wm>	 PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0
[03:33:02] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0
[03:33:22] <wikibugs_>	 10Operations, 10Traffic: Renew unified certificates 2017 - https://phabricator.wikimedia.org/T178173#3692711 (10BBlack)
[03:33:24] <wikibugs_>	 10Operations, 10Traffic, 10Wikimedia-Planet, 10procurement: *.planet.wikimedia.org SSL cert expires 2017-11-22 - https://phabricator.wikimedia.org/T178444#3692714 (10BBlack)
[03:33:27] <wikibugs_>	 10Operations, 10Phabricator, 10Traffic, 10procurement, 10HTTPS: wmfusercontent.org SSL cert expires 2017-11-22 - https://phabricator.wikimedia.org/T178443#3692715 (10BBlack)
[03:46:12] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0
[03:46:12] <icinga-wm>	 RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0
[03:59:32] <icinga-wm>	 PROBLEM - HHVM rendering on mw2132 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[04:00:23] <icinga-wm>	 RECOVERY - HHVM rendering on mw2132 is OK: HTTP OK: HTTP/1.1 200 OK - 74341 bytes in 0.312 second response time
[04:25:24] <icinga-wm>	 PROBLEM - Apache HTTP on mw2218 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[04:26:13] <icinga-wm>	 RECOVERY - Apache HTTP on mw2218 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.118 second response time
[04:28:43] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 185.06 seconds
[05:20:58] <wikibugs_>	 10Operations, 10ops-eqiad, 10DBA: db1101 crashed - memory errors - https://phabricator.wikimedia.org/T178383#3692787 (10Marostegui) The ALTER tables finished correctly and no more crashes happened. Let's change the memory anyways @Cmjohnson
[05:22:16] <wikibugs_>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384921
[05:22:20] <wikibugs_>	 (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384921
[05:24:02] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384921 (owner: 10Marostegui)
[05:25:19] <marostegui>	 !log Optimize pagelinks and templatelinks on db1102 for s2 - T174509
[05:25:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:25:27] <stashbot>	 T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509
[05:26:27] <wikibugs_>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384921 (owner: 10Marostegui)
[05:26:36] <wikibugs_>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384921 (owner: 10Marostegui)
[05:27:38] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1067 - T174509 (duration: 00m 50s)
[05:27:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:33:00] <wikibugs_>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384922 (https://phabricator.wikimedia.org/T174509)
[05:33:36] <wikibugs_>	 (03PS3) 10Marostegui: install_server: Reinstall db2084 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/384690 (https://phabricator.wikimedia.org/T178359)
[05:34:55] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384922 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui)
[05:36:08] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] install_server: Reinstall db2084 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/384690 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui)
[05:36:10] <wikibugs_>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384922 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui)
[05:36:18] <wikibugs_>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384922 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui)
[05:36:42] <marostegui>	 !log Optimize pagelinks and templatelinks on db1098 - T174509
[05:36:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:36:48] <stashbot>	 T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509
[05:37:29] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1098 - T174509 (duration: 00m 49s)
[05:37:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:39:16] <wikibugs_>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384923 (https://phabricator.wikimedia.org/T174509)
[05:40:49] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384923 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui)
[05:40:59] <marostegui>	 !log Optimize pagelinks and templatelinks on db1055 - T174509
[05:41:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:42:32] <wikibugs_>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384923 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui)
[05:42:45] <wikibugs_>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384923 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui)
[05:43:51] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1055 - T174509 (duration: 00m 50s)
[05:43:56] <wikibugs_>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1072 to fix data drifts" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384924
[05:43:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:43:58] <stashbot>	 T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509
[05:44:00] <wikibugs_>	 (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1072 to fix data drifts" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384924
[05:45:33] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1072 to fix data drifts" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384924 (owner: 10Marostegui)
[05:46:37] <wikibugs_>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1072 to fix data drifts" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384924 (owner: 10Marostegui)
[05:48:05] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1072 - T164488 (duration: 00m 49s)
[05:48:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:48:13] <stashbot>	 T164488: Run pt-table-checksum on s3 - https://phabricator.wikimedia.org/T164488
[05:48:13] <wikibugs_>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1072 to fix data drifts" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384924 (owner: 10Marostegui)
[06:03:43] <icinga-wm>	 PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.160 second response time
[06:04:42] <icinga-wm>	 PROBLEM - Blazegraph Port on wdqs1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[06:04:43] <icinga-wm>	 PROBLEM - Blazegraph process on wdqs1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[06:04:43] <icinga-wm>	 PROBLEM - Updater process on wdqs1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[06:04:53] <icinga-wm>	 PROBLEM - SSH on wdqs1003 is CRITICAL: Server answer
[06:05:02] <icinga-wm>	 PROBLEM - configured eth on wdqs1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[06:05:12] <icinga-wm>	 PROBLEM - Disk space on wdqs1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[06:05:13] <icinga-wm>	 PROBLEM - Check size of conntrack table on wdqs1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[06:05:23] <icinga-wm>	 PROBLEM - dhclient process on wdqs1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[06:05:23] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on wdqs1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[06:05:33] <icinga-wm>	 PROBLEM - DPKG on wdqs1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[06:05:33] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[06:05:43] <icinga-wm>	 PROBLEM - puppet last run on wdqs1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[06:06:03] <icinga-wm>	 PROBLEM - WDQS HTTP Port on wdqs1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[06:09:14] <wikibugs_>	 (03CR) 10Giuseppe Lavagetto: [C: 04-2] "I would rather have the process of syncing facts be a complex process where you run a program that cleans up any potentially sensible data" [puppet] - 10https://gerrit.wikimedia.org/r/384834 (https://phabricator.wikimedia.org/T177843) (owner: 10Herron)
[06:25:22] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on wdqs1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[06:27:11] <hoo>	 can't ssh into that :/
[06:27:42] <icinga-wm>	 PROBLEM - IPMI Sensor Status on wdqs1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[06:30:13] <hoo>	 Any able to help? I can't ssh into it, so can't do anything to wdsq1003
[06:30:14] <hoo>	 not even depool it
[06:30:52] <icinga-wm>	 RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.590 second response time
[06:33:31] <hoo>	 _joe_: ^
[06:33:47] <_joe_>	 hoo: I am trying
[06:34:22] <_joe_>	 hoo: I can depool it, but pybal should depool it automatically
[06:34:33] <hoo>	 well, it didn't :/
[06:35:07] <_joe_>	 ok so that's a problem of how the check is done I think
[06:35:23] <_joe_>	 ema: ^^ can you take a look at the pybal logs while I kick the machine?
[06:37:19] <ema>	 _joe_: sure
[06:38:29] <_joe_>	 so I'm in the console, it seems an OOM
[06:38:48] <hoo>	 not much of a surprise… gehel is doing GC testing right now
[06:39:16] <ema>	 there's no pybal log on lvs1003 re: wdqs1003
[06:39:28] <ema>	 $ curl  http://localhost:9090/pools/wdqs_80
[06:39:28] <ema>	 wdqs1003.eqiad.wmnet:	enabled/up/pooled
[06:39:28] <ema>	 wdqs1005.eqiad.wmnet:	enabled/up/pooled
[06:39:28] <ema>	 wdqs1004.eqiad.wmnet:	enabled/up/pooled
[06:39:53] <ema>	 and indeed HTTP requests to wdqs1003:80 work fine
[06:40:47] <hoo>	 that's the reverse proxy in front of the actual service
[06:42:05] <_joe_>	 ema: what url does get checked there?
[06:42:50] <ema>	 _joe_: proxyfetch.url = ["http://localhost/"]
[06:43:15] <_joe_>	 ok, meh, that's the reason
[06:43:42] <hoo>	 it should check both that and port 9999 (where the actual service is running on)
[06:43:45] <ema>	 and I get stuff like [...] <title>Wikidata Query Service</title> 
[06:44:16] <_joe_>	 hoo: no you can't check port 9999
[06:44:26] <_joe_>	 you should check some url that is meaningful
[06:44:35] <_joe_>	 and shows if the service is up and functioning
[06:44:58] <hoo>	 It could probably run a SELECT "1"-like dummy query cheaply
[06:45:16] <ema>	 FTR there's this dashboard you can check to see what pybal thinks of its services: https://grafana.wikimedia.org/dashboard/db/pybal?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-server=lvs1003&var-service=All 
[06:46:17] <_joe_>	 !log powercycling wdqs1003
[06:46:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:47:52] <icinga-wm>	 PROBLEM - Host wdqs1003 is DOWN: PING CRITICAL - Packet loss = 100%
[06:47:54] <ema>	 hoo: what's the GET request producing that query?
[06:48:32] <icinga-wm>	 RECOVERY - dhclient process on wdqs1003 is OK: PROCS OK: 0 processes with command name dhclient
[06:48:32] <icinga-wm>	 RECOVERY - Check size of conntrack table on wdqs1003 is OK: OK: nf_conntrack is 0 % full
[06:48:33] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on wdqs1003 is OK: OK ferm input default policy is set
[06:48:42] <icinga-wm>	 RECOVERY - Host wdqs1003 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms
[06:48:42] <icinga-wm>	 RECOVERY - DPKG on wdqs1003 is OK: All packages OK
[06:48:43] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1003 is OK: OK - running: The system is fully operational
[06:48:58] <_joe_>	 we could even add a /liveness url to the nginx frontend that does the query for us
[06:49:02] <icinga-wm>	 RECOVERY - SSH on wdqs1003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0)
[06:49:03] <icinga-wm>	 RECOVERY - configured eth on wdqs1003 is OK: OK - interfaces up
[06:49:09] <_joe_>	 pretty much like probes work in kubernetes 
[06:49:12] <icinga-wm>	 RECOVERY - WDQS HTTP Port on wdqs1003 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 80
[06:49:19] <gehel>	 I'm still dropping of Oscar at daycare. _joe_ , hoo , thanks for taking care of wdqs1003!
[06:49:22] <icinga-wm>	 RECOVERY - Disk space on wdqs1003 is OK: DISK OK
[06:50:25] <_joe_>	 gehel: you'll be needed for the followup, don't worry ;)
[06:50:35] <gehel>	 BTW, GC tuning was done on wdqs1004, so the cause is something else...
[06:50:43] <icinga-wm>	 RECOVERY - puppet last run on wdqs1003 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures
[06:51:15] <_joe_>	 gehel: ping me when you're back, there is no hurry now
[06:53:24] <gehel>	 _joe_: will do
[06:55:22] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on wdqs1003 is OK: OK: synced at Wed 2017-10-18 06:55:14 UTC.
[06:57:42] <icinga-wm>	 RECOVERY - IPMI Sensor Status on wdqs1003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK
[07:02:29] <marostegui>	 !log Stop MySQL on db2079 to copy its data to db2084 - T178359
[07:02:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:02:39] <stashbot>	 T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359
[07:07:56] <icinga-wm>	 PROBLEM - High lag on wdqs1003 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1800.0]
[07:08:11] <SMalyshev>	 hmm this is weird. wdqs1003 should have tons of memory and java heap is way below it
[07:08:56] <SMalyshev>	 unfortunately I can't access logs :(
[07:10:26] <icinga-wm>	 PROBLEM - Disk space on contint1001 is CRITICAL: DISK CRITICAL - /var/lib/docker/overlay2/792542d039b7e66780e8b3e9a00fd5e2ab986252b8ac1f0d74719d45be63cb66/merged is not accessible: Permission denied
[07:10:52] <SMalyshev>	 but it looks it was in trouble since 21:00 judging by graphs, and somehow icinga alerted only at 23:00
[07:11:56] <SMalyshev>	 I think we should depool wdqs1003 for now, it's 9 hrs behind, not really good to serve queries from it
[07:12:45] <SMalyshev>	 _joe_: could you depool it and re-pool when it catches up? you could see the lag at https://grafana.wikimedia.org/dashboard/db/wikidata-query-service
[07:13:20] * gehel is back
[07:13:31] <gehel>	 SMalyshev, _joe_: I'll depool wdqs1003
[07:13:42] <SMalyshev>	 ok, cool
[07:13:56] <SMalyshev>	 and probably worth looking in the logs to see what happened there....
[07:14:04] <gehel>	 !log depooling wdqs1003 until it catches up on updates
[07:14:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:18:39] <_joe_>	 gehel: quite a few things to fix here
[07:18:55] <gehel>	 _joe_: yep. First LVS check
[07:19:20] <_joe_>	 gehel: yeah so what we need is to have a url we can call from the load-balancer
[07:19:32] <icinga-wm>	 RECOVERY - Disk space on contint1001 is OK: DISK OK
[07:19:33] <_joe_>	 that tells us if wdqs is up and has an acceptable lag
[07:19:38] <gehel>	 _joe_: that one should be easy
[07:19:57] <_joe_>	 so I'd love to have a "liveness" check and a "readyness" check
[07:20:14] <_joe_>	 pretty much like kubernetes does
[07:21:29] <gehel>	 _joe_: not sure I understand...
[07:22:13] <SMalyshev>	 gehel: I see a lot of throttling messages in the log... looks like somebody was really spamming the server... and throttling didn't help
[07:22:31] <icinga-wm>	 PROBLEM - Disk space on contint1001 is CRITICAL: DISK CRITICAL - /var/lib/docker/overlay2/477586f8c192c241d251dcda6d0d050581e5e075b25bf945435a8cb391d8e64d/merged is not accessible: Permission denied
[07:28:46] <wikibugs_>	 (03PS3) 10Hoo man: Enable description usage tracking on a few test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384003 (https://phabricator.wikimedia.org/T177155)
[07:28:48] <wikibugs_>	 (03PS2) 10Hoo man: Re-enable Statement usage tracking on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384592 (https://phabricator.wikimedia.org/T151717)
[07:31:37] <_joe_>	 gehel: something like
[07:32:03] <_joe_>	 one url we can call that tells us "wdqs can perform basic queries and has an acceptable lag"
[07:32:14] <_joe_>	 by responding 200
[07:32:31] <_joe_>	 and some other http code if that's not the case
[07:32:39] <_joe_>	 that url could be checked by pybal
[07:35:08] <wikibugs_>	 10Operations, 10Discovery, 10Recommendation-API, 10Wikidata, and 3 others: flapping monitoring for recommendation_api on scb - https://phabricator.wikimedia.org/T178445#3692887 (10mobrovac) The 429 coming from WDQS. @Gehel, @Smalyshev would it be possible to split WDQS' rate limiting for internal and exter...
[07:37:11] <mobrovac>	 is there an ongoing known issue with WDQS?
[07:38:55] <_joe_>	 mobrovac: read backlog
[07:39:02] <_joe_>	 mobrovac: whay are you asking?
[07:39:28] <wikibugs_>	 10Operations, 10Discovery, 10Recommendation-API, 10Wikidata, and 3 others: flapping monitoring for recommendation_api on scb - https://phabricator.wikimedia.org/T178445#3692895 (10Smalyshev) @mborovac what's the rate the requests are currently sent at? IIRC the limits we have are pretty generous, but depen...
[07:39:31] <_joe_>	 oh our beloved SOA with services failing in sequence
[07:39:32] <mobrovac>	 because the recommendation api service was flapping _joe_
[07:39:45] <_joe_>	 I love that so much
[07:40:03] <wikibugs_>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1034" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384937
[07:40:16] <wikibugs_>	 (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1034" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384937
[07:40:26] <wikibugs_>	 10Operations, 10monitoring, 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Kanban): Icinga disk space alert when a Docker container is running on an host - https://phabricator.wikimedia.org/T178454#3692896 (10hashar) For a running Docker container we have: ``` overlay on /v...
[07:40:28] <_joe_>	 mobrovac: because instead of creating microservices, we create filters, but that's a longer discussion ofc 
[07:40:32] <icinga-wm>	 RECOVERY - Disk space on contint1001 is OK: DISK OK
[07:40:35] <wikibugs_>	 10Operations, 10Discovery, 10Recommendation-API, 10Wikidata, and 3 others: flapping monitoring for recommendation_api on scb - https://phabricator.wikimedia.org/T178445#3692899 (10Gehel) @mobrovac it is possible if we can identify internal traffic. The throttling we apply is bucketed by user agent / IP, so...
[07:40:57] <mobrovac>	 huh reading the backlog is a bit hard with all these notifications
[07:41:11] <hashar>	 the contint1001 disk space alarm  is triggered whenever a container run on the server :/   So most probably it can be ignored for now and I have filled T178454
[07:41:13] <stashbot>	 T178454: Icinga disk space alert when a Docker container is running on an host - https://phabricator.wikimedia.org/T178454
[07:41:34] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1034" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384937 (owner: 10Marostegui)
[07:41:54] <_joe_>	 hashar: which storage driver are you using?
[07:42:27] <_joe_>	 I would strongly advise against using the loopback devicemapper
[07:42:35] <_joe_>	 that's what is causing trouble
[07:42:52] <wikibugs_>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1034" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384937 (owner: 10Marostegui)
[07:43:07] <wikibugs_>	 (03PS1) 10Gehel: wdqs: LVS check should reach blazegraph and do a simple query [puppet] - 10https://gerrit.wikimedia.org/r/384938
[07:43:10] <wikibugs_>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1034" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384937 (owner: 10Marostegui)
[07:43:16] <hashar>	 _joe_: hello;  I have no idea about the storage driver.  I guess we just installed docker-ce package for now and havent tweaked any settings
[07:44:08] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1034 - T174509 (duration: 00m 59s)
[07:44:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:44:16] <stashbot>	 T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509
[07:44:53] <wikibugs_>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384939 (https://phabricator.wikimedia.org/T174509)
[07:46:44] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384939 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui)
[07:46:53] <marostegui>	 !log Optimize pagelinks and templatelinks on db1056 - T174509
[07:47:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:48:48] <wikibugs_>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384939 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui)
[07:49:49] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1056 - T174509 (duration: 00m 49s)
[07:49:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:49:57] <stashbot>	 T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509
[07:53:34] <wikibugs_>	 10Operations, 10Discovery, 10Recommendation-API, 10Wikidata, and 3 others: flapping monitoring for recommendation_api on scb - https://phabricator.wikimedia.org/T178445#3692915 (10mobrovac) >>! In T178445#3692895, @Smalyshev wrote: > @mobrovac what's the rate the requests are currently sent at? IIRC the li...
[07:55:27] <wikibugs_>	 (03CR) 10Muehlenhoff: "Also reported in Debian: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=878920" [puppet] - 10https://gerrit.wikimedia.org/r/384713 (https://phabricator.wikimedia.org/T174431) (owner: 10Muehlenhoff)
[08:00:05] <jouncebot>	 hoo: #bothumor My software never has bugs. It just develops random features. Rise for Usage tracking. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171018T0800).
[08:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[08:01:15] <wikibugs_>	 (03PS1) 10Muehlenhoff: Remove access for gwicke [puppet] - 10https://gerrit.wikimedia.org/r/384944
[08:01:20] <wikibugs_>	 (03PS1) 10Ema: cumin: add aliases for cache hosts [puppet] - 10https://gerrit.wikimedia.org/r/384945
[08:01:38] <wikibugs_>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384939 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui)
[08:02:38] <SMalyshev>	 mobrovac: does recommendation service use specific user agent? I can't locate those 429s in the logs...
[08:04:42] <mobrovac>	 SMalyshev: it sets the user agent to "Recommendation API (Wikimedia tool; learn more at https://meta.wikimedia.org/wiki/Recommendation_API)"
[08:05:40] <wikibugs_>	 (03CR) 10Hoo man: [C: 032] Enable description usage tracking on a few test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384003 (https://phabricator.wikimedia.org/T177155) (owner: 10Hoo man)
[08:07:15] <wikibugs_>	 (03Merged) 10jenkins-bot: Enable description usage tracking on a few test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384003 (https://phabricator.wikimedia.org/T177155) (owner: 10Hoo man)
[08:07:15] <gehel>	 mobrovac: I just found it in https://logstash.wikimedia.org/goto/37bc20a91c05ad1c781500fcc806236f
[08:07:25] <wikibugs_>	 (03CR) 10jenkins-bot: Enable description usage tracking on a few test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384003 (https://phabricator.wikimedia.org/T177155) (owner: 10Hoo man)
[08:09:27] <mobrovac>	 gehel: 90 reqs per 30 minutes sould about the right rate
[08:09:34] <logmsgbot>	 !log hoo@tin Synchronized wmf-config/InitialiseSettings.php: Enable description usage tracking on a few test wikis (T177155) (duration: 00m 50s)
[08:09:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:09:43] <stashbot>	 T177155: Find test wiki(s) for new description usage and enable there - https://phabricator.wikimedia.org/T177155
[08:10:43] <SMalyshev>	 mobrovac: yes, I can see them now...but I see also a bunch of requests from the same IP without user agent
[08:11:40] <mobrovac>	 SMalyshev: these are the scb nodes, which have ~10 services, so that might be any other of them (even though I'm not sure there should be any reqs without the UA)
[08:12:56] <SMalyshev>	 ah no I was looking at it wrong
[08:13:30] <SMalyshev>	 now looking per user agent I see about 2 req/s with this UA
[08:13:43] <mobrovac>	 2 reqs/s ?
[08:13:55] <hoo>	 Not a single new description usage in 4 minutes… that's even more low profile than anticipated
[08:14:17] <wikibugs_>	 (03CR) 10Muehlenhoff: [C: 032] Remove access for gwicke [puppet] - 10https://gerrit.wikimedia.org/r/384944 (owner: 10Muehlenhoff)
[08:14:32] <SMalyshev>	 in some seconds it's 5 reqs... my kibana-fu not strong enough to make proper calculation though
[08:14:52] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0
[08:14:59] <mobrovac>	 that is highly unusual
[08:15:27] <wikibugs_>	 (03PS1) 10Marostegui: core_multiinstance.my.cnf: Fixed typo [puppet] - 10https://gerrit.wikimedia.org/r/384946
[08:16:22] <SMalyshev>	 they seem to be all 200s though (though I suspect we somehow fail to log 429s...) anyway, it's late here, I'll check more into it tomorrow
[08:16:37] <mobrovac>	 thnx SMalyshev
[08:17:15] <wikibugs_>	 (03PS2) 10Marostegui: core_multiinstance.my.cnf: Fixed typo [puppet] - 10https://gerrit.wikimedia.org/r/384946
[08:17:16] <wikibugs_>	 10Operations, 10Discovery, 10Recommendation-API, 10Wikidata, and 3 others: flapping monitoring for recommendation_api on scb - https://phabricator.wikimedia.org/T178445#3692940 (10Gehel) Looking at [[ https://logstash.wikimedia.org/goto/37bc20a91c05ad1c781500fcc806236f | logs in logstash ]], it seems we th...
[08:17:20] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] core_multiinstance.my.cnf: Fixed typo [puppet] - 10https://gerrit.wikimedia.org/r/384946 (owner: 10Marostegui)
[08:17:54] <wikibugs_>	 (03CR) 10Volans: "see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/384945 (owner: 10Ema)
[08:22:21] <logmsgbot>	 !log mobrovac@tin Started deploy [changeprop/deploy@065a06e]: Bug fix: Apply ignore_errors on a per-request basis
[08:22:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:23:39] <logmsgbot>	 !log mobrovac@tin Finished deploy [changeprop/deploy@065a06e]: Bug fix: Apply ignore_errors on a per-request basis (duration: 01m 17s)
[08:23:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:25:17] <marostegui>	 !log Stop MySQL on db2019 to copy its data to db2084 - T178359
[08:25:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:25:24] <stashbot>	 T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359
[08:25:33] <wikibugs__>	 (03PS2) 10Hashar: prometheus: force ferm dns resolution to Ipv4 [puppet] - 10https://gerrit.wikimedia.org/r/381073 (https://phabricator.wikimedia.org/T176314)
[08:26:10] <wikibugs>	 (03CR) 10Hashar: "Ditto for:" [puppet] - 10https://gerrit.wikimedia.org/r/381073 (https://phabricator.wikimedia.org/T176314) (owner: 10Hashar)
[08:29:21] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0
[08:31:38] <wikibugs>	 (03PS1) 10Marostegui: s5.hosts: Change db2084 port to 3315 [software] - 10https://gerrit.wikimedia.org/r/384948 (https://phabricator.wikimedia.org/T178359)
[08:33:26] <wikibugs>	 (03CR) 10Marostegui: [C: 032] s5.hosts: Change db2084 port to 3315 [software] - 10https://gerrit.wikimedia.org/r/384948 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui)
[08:33:51] <wikibugs>	 (03PS3) 10Hoo man: Re-enable Statement usage tracking on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384592 (https://phabricator.wikimedia.org/T151717)
[08:34:06] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I would prefer us to add a config to nginx to proxy a request to a simple url to whatever logic we want, abstracted as much as possible fr" [puppet] - 10https://gerrit.wikimedia.org/r/384938 (owner: 10Gehel)
[08:34:27] <wikibugs>	 (03Merged) 10jenkins-bot: s5.hosts: Change db2084 port to 3315 [software] - 10https://gerrit.wikimedia.org/r/384948 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui)
[08:34:48] <wikibugs>	 (03CR) 10Gehel: "Ok, will do..." [puppet] - 10https://gerrit.wikimedia.org/r/384938 (owner: 10Gehel)
[08:37:07] <wikibugs>	 (03PS2) 10Ema: cumin: add aliases for cache hosts [puppet] - 10https://gerrit.wikimedia.org/r/384945
[08:38:50] <wikibugs>	 (03CR) 10Hoo man: Re-enable Statement usage tracking on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384592 (https://phabricator.wikimedia.org/T151717) (owner: 10Hoo man)
[08:38:53] <wikibugs>	 (03CR) 10Hoo man: [C: 032] Re-enable Statement usage tracking on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384592 (https://phabricator.wikimedia.org/T151717) (owner: 10Hoo man)
[08:39:17] <gehel>	 mobrovac: looking at https://logstash.wikimedia.org/goto/9209cb15771bce41c14dd744e68e606f I see much more than 1.5 req / minute... More like 30 req / minute...
[08:39:54] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10Thumbor, 10Release-Engineering-Team (Kanban): nutcracker fails to start due to lack of /var/run/nutcracker (ex: deployment-videoscaler01 has memcached failures) - https://phabricator.wikimedia.org/T178457#3692957 (10hashar) Fixed it by MANUALLY creating a `/va...
[08:40:05] <wikibugs>	 (03Merged) 10jenkins-bot: Re-enable Statement usage tracking on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384592 (https://phabricator.wikimedia.org/T151717) (owner: 10Hoo man)
[08:40:09] <wikibugs>	 (03CR) 10jenkins-bot: Re-enable Statement usage tracking on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384592 (https://phabricator.wikimedia.org/T151717) (owner: 10Hoo man)
[08:40:51] <icinga-wm>	 PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): User[gwicke]
[08:40:53] <wikibugs>	 (03PS3) 10Ema: cumin: add aliases for cache hosts [puppet] - 10https://gerrit.wikimedia.org/r/384945
[08:41:48] <logmsgbot>	 !log hoo@tin Synchronized wmf-config/InitialiseSettings.php: Re-enable Statement usage tracking on cawiki (T151717) (duration: 00m 50s)
[08:42:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:42:00] <stashbot>	 T151717: Usage tracking: record which statement group is used - https://phabricator.wikimedia.org/T151717
[08:42:21] <wikibugs>	 (03CR) 10Jcrespo: [C: 031] mysql/icinga/labtest: no pages if on labtest, pt.2 [puppet] - 10https://gerrit.wikimedia.org/r/384895 (https://phabricator.wikimedia.org/T178008) (owner: 10Dzahn)
[08:44:23] <mobrovac>	 gehel: hm, this is weird, i can't see external requests to it on the public api metrics - https://grafana-admin.wikimedia.org/dashboard/db/restbase?panelId=15&fullscreen&orgId=1
[08:44:54] <wikibugs>	 (03CR) 10Ema: "https://puppet-compiler.wmflabs.org/compiler02/8362/neodymium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/384945 (owner: 10Ema)
[08:45:19] <wikibugs>	 (03PS1) 10Marostegui: mysql-core_codfw.yaml: Add db2084 to s4 and s5 [puppet] - 10https://gerrit.wikimedia.org/r/384950 (https://phabricator.wikimedia.org/T178359)
[08:45:51] <mobrovac>	 gehel: also, there seem to be a lot of requests to wdqs in codfw, which i'm not sure it's even possible
[08:45:55] <mobrovac>	 what is going on here?
[08:46:58] <wikibugs>	 (03CR) 10Marostegui: [C: 032] mysql-core_codfw.yaml: Add db2084 to s4 and s5 [puppet] - 10https://gerrit.wikimedia.org/r/384950 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui)
[08:50:12] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0
[08:50:32] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0
[08:51:10] <wikibugs>	 (03PS1) 10Hoo man: Allow specifying --group to sql [puppet] - 10https://gerrit.wikimedia.org/r/384951
[08:51:11] <mobrovac>	 gehel: the majority of the requests seem to be generated in codfw, which is really strange
[08:51:15] <gehel>	 mobrovac: wdqs is active / active, so traffic to codfw is definitely possible
[08:52:05] <gehel>	 mobrovac: also, all of those requests are for /sparql, with no actual query, so it looks like a sanity check
[08:52:37] <mobrovac>	 gehel: yeah, but we still respect the DC boundaries, so if a request originates in eqiad, it will go to wdqs in eqiad, unless its discovery dns is set up differently
[08:53:41] <gehel>	 in the end, wdqs was under too much load (our throttling does not seem to be aggressive enough) and was misbehaving, that's what need to be solved first!
[08:57:05] <wikibugs>	 10Operations, 10Operations-Software-Development: Upgrade Cumin masters to stretch - https://phabricator.wikimedia.org/T177385#3692999 (10MoritzMuehlenhoff) Procurement ticket is T178392
[08:58:38] <wikibugs>	 10Operations, 10Collaboration-Team-Triage, 10Notifications, 10Anti-Harassment (AHT Sprint 7), 10Patch-For-Review: Recover Echo Notification Blacklist from Backup - https://phabricator.wikimedia.org/T178313#3693001 (10jcrespo) > why would there be ~1,500 with 0's exist on English Wikipedia  As I commented...
[08:59:16] <wikibugs>	 10Operations, 10Ops-Access-Requests, 10Research, 10Patch-For-Review: Request public key change for a research fellow - https://phabricator.wikimedia.org/T177889#3693004 (10Cervisiarius) Thanks. I'm trying to log in as I used to from my old machine: $ ssh west1@stat1005.eqiad.wmnet but I get the (expected)...
[09:02:27] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix Cumin alias for druid [puppet] - 10https://gerrit.wikimedia.org/r/384952
[09:02:47] <wikibugs>	 (03CR) 10Volans: "nitpick inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/384945 (owner: 10Ema)
[09:02:54] <mobrovac>	 gehel: as for the queries to /sparql, the service sends them in the post body
[09:03:27] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Port docker builder (0323 comments) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/384081 (https://phabricator.wikimedia.org/T177276) (owner: 10Giuseppe Lavagetto)
[09:03:28] <gehel>	 mobrovac: Oh, of course!
[09:03:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Fix Cumin alias for druid [puppet] - 10https://gerrit.wikimedia.org/r/384952 (owner: 10Muehlenhoff)
[09:03:42] <wikibugs>	 (03PS2) 10Muehlenhoff: Fix Cumin alias for druid [puppet] - 10https://gerrit.wikimedia.org/r/384952
[09:20:30] <wikibugs>	 (03CR) 10Jcrespo: [C: 031] "+1 to the idea, specially to send a message later to all deployers. The code itself is mostly mediawiki/releng, so cannot comment much on " [puppet] - 10https://gerrit.wikimedia.org/r/384951 (owner: 10Hoo man)
[09:21:55] <wikibugs>	 (03PS2) 10Gehel: wdqs: LVS check should reach blazegraph and do a simple query [puppet] - 10https://gerrit.wikimedia.org/r/384938
[09:22:56] <wikibugs>	 (03CR) 10Gehel: "Actually, we might need to split this in 2 CR for deployment..." [puppet] - 10https://gerrit.wikimedia.org/r/384938 (owner: 10Gehel)
[09:23:24] <mobrovac>	 gehel: ok, i think i know where all of these requests are coming from. so, in one round of checks, the service sends 3 requests to wdqs (each 60 secs), multiplied by 10 hosts doing this, that's 0.5 reqs/s, however, when the automatic monitoring script doesn't receive the status it was expecting, it retries up to 5 times, which brings us to ~2reqs/s overall
[09:30:18] <elukey>	 !log drop MobileWikiAppToCInteraction_10375484_15423246 from the log database on dbstore1002,db1047,db1046 - T177960
[09:30:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:30:25] <stashbot>	 T177960: Archive tables to hadoop:  MobileWikiAppToCInteraction_10375484_15423246 and Edit_13457736_15423246 - https://phabricator.wikimedia.org/T177960
[09:30:42] <gehel>	 mobrovac: Oh... lovely retries! Without backoff I expect?
[09:31:50] <mobrovac>	 i think so, yes
[09:32:07] <mobrovac>	 or with a minimal back-off rate
[09:34:01] <icinga-wm>	 PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.231 second response time
[09:36:25] <jynus>	 !log reloading dbproxy1010's haproxy configuration
[09:36:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:39] <moritzm>	 !log installing xserver/xvfb security updates
[09:39:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:40:10] <gehel>	 mobrovac: so in a bad situation, those retries probably make it worse...
[09:40:28] <gehel>	 In the end we are back to the initial question: how do we make wdqs more robust...
[09:40:29] <wikibugs>	 (03PS3) 10Muehlenhoff: Provide deb-src entries for older distros on package builders [puppet] - 10https://gerrit.wikimedia.org/r/382160
[09:41:16] <wikibugs>	 (03PS4) 10Ema: cumin: add aliases for cache hosts [puppet] - 10https://gerrit.wikimedia.org/r/384945
[09:41:19] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1098" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384955
[09:41:22] <wikibugs>	 (03PS4) 10Elukey: Small refactor for some kafka classes to ease creation of mirror maker profile [puppet] - 10https://gerrit.wikimedia.org/r/384602 (https://phabricator.wikimedia.org/T177216) (owner: 10Ottomata)
[09:41:24] <wikibugs>	 (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1098" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384955
[09:43:10] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1098" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384955 (owner: 10Marostegui)
[09:43:38] <wikibugs>	 (03PS5) 10Elukey: confluent::kafka: refactor existing code with a commons class [puppet] - 10https://gerrit.wikimedia.org/r/384602 (https://phabricator.wikimedia.org/T177216) (owner: 10Ottomata)
[09:43:54] <jynus>	 db1082?
[09:44:25] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1098" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384955 (owner: 10Marostegui)
[09:44:47] <jynus>	 did it crash?
[09:45:02] <wikibugs>	 (03CR) 10Volans: [C: 031] "LGTM! thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/384945 (owner: 10Ema)
[09:45:33] <jynus>	 I think it did
[09:45:38] <marostegui>	 looks so
[09:46:10] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1098" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384955 (owner: 10Marostegui)
[09:46:10] <marostegui>	 storage
[09:46:12] <marostegui>	 looks broken
[09:46:15] <wikibugs>	 (03CR) 10Ema: [C: 032] cumin: add aliases for cache hosts [puppet] - 10https://gerrit.wikimedia.org/r/384945 (owner: 10Ema)
[09:46:19] <marostegui>	 are you depooling it?
[09:46:56] <jynus>	 yes ^
[09:46:57] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Emergency depool of db1082 (crashed) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384956
[09:47:08] <wikibugs>	 (03CR) 10Marostegui: [C: 031] mariadb: Emergency depool of db1082 (crashed) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384956 (owner: 10Jcrespo)
[09:47:10] <jynus>	 +1?
[09:47:29] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Emergency depool of db1082 (crashed) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384956 (owner: 10Jcrespo)
[09:47:53] <wikibugs>	 10Operations, 10DBA: db1082 crashed - https://phabricator.wikimedia.org/T178460#3693095 (10Marostegui)
[09:48:34] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: Emergency depool of db1082 (crashed) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384956 (owner: 10Jcrespo)
[09:49:01] <wikibugs>	 10Operations, 10DBA: db1082 crashed - https://phabricator.wikimedia.org/T178460#3693109 (10Marostegui) Not the first time this happens to this same host: T158188 T145533
[09:49:31] <icinga-wm>	 PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1962 bytes in 0.129 second response time
[09:49:58] <jynus>	 marostegui: repool db1098
[09:50:03] <marostegui>	 ok
[09:50:06] <jynus>	 ok to deploy or not?
[09:50:10] <wikibugs>	 10Operations, 10Datasets-General-or-Unknown, 10Patch-For-Review: NFS on dataset1001 overloaded, high load on the hosts that mount it - https://phabricator.wikimedia.org/T169680#3693113 (10ArielGlenn) In order to get dewiki to complete on time (before the 20th), I'm running the 4th part of the rev history con...
[09:50:15] <marostegui>	 jynus: ok to depool
[09:50:22] <jynus>	 no, the merge is to repool
[09:50:29] <marostegui>	 repool yes
[09:50:31] <marostegui>	 ok to repool
[09:51:41] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1098, depool db1082 (crashed) (duration: 00m 50s)
[09:51:47] <jynus>	 did db1082 came back?
[09:51:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:27] <wikibugs>	 10Operations, 10DBA: db1082 crashed - https://phabricator.wikimedia.org/T178460#3693117 (10Marostegui) The RAID looks good:  ``` root@db1082:~# hpssacli controller all show config  Smart Array P840 in Slot 1                (sn: PDNNF0ARH1910I)      Port Name: 1I     Port Name: 2I     Internal Drive Cage at Por...
[09:53:47] <jynus>	 171018  9:42:42 [ERROR] InnoDB: Tried to read 16384 bytes at offset 595802554368. Was only able to read 0.
[09:53:48] <jynus>	 2017-10-18 09:42:42 7f3908df7700  InnoDB: Operating system error number 5 in a file operation.
[09:53:50] <jynus>	 InnoDB: Error number 5 means 'Input/output error'
[09:54:00] <jynus>	 clearly block device error
[09:54:04] <wikibugs>	 (03CR) 10Elukey: [C: 032] confluent::kafka: refactor existing code with a commons class [puppet] - 10https://gerrit.wikimedia.org/r/384602 (https://phabricator.wikimedia.org/T177216) (owner: 10Ottomata)
[09:54:11] <wikibugs>	 (03PS6) 10Elukey: confluent::kafka: refactor existing code with a commons class [puppet] - 10https://gerrit.wikimedia.org/r/384602 (https://phabricator.wikimedia.org/T177216) (owner: 10Ottomata)
[09:54:21] <jynus>	 171018  9:42:42 [ERROR] InnoDB: File (unknown): 'read' returned OS error 105. Cannot continue operation
[09:54:30] <jynus>	 171018 09:43:11 mysqld_safe Number of processes running now: 0
[09:54:54] <wikibugs>	 10Operations, 10ops-eqdfw, 10DBA: db1082 crashed - https://phabricator.wikimedia.org/T178460#3693095 (10Marostegui) p:05Triage>03High This is the second time this server has a storage crash: T158188 @Cmjohnson can we get a new RAID controller for this host? It has happened twice already.
[09:55:01] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0]
[09:55:16] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1082 crashed - https://phabricator.wikimedia.org/T178460#3693131 (10Marostegui)
[09:55:20] <jynus>	 db1044 is lagging behind, is it pooled?
[09:55:39] <marostegui>	 db1044 has no lag
[09:55:44] <marostegui>	         Seconds_Behind_Master: 0
[09:55:51] <icinga-wm>	 RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[09:55:54] <jynus>	 "Server db1044 has 5.4986140727997 seconds of lag (>= 4.5989730358124)"
[09:55:56] <jynus>	 on logs
[09:56:13] <marostegui>	 ah, maybe it is spiking some seconds (it has 0 weight) because of mydumper
[09:58:01] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0]
[09:59:31] <icinga-wm>	 RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1947 bytes in 0.096 second response time
[10:00:11] <icinga-wm>	 RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.525 second response time
[10:01:39] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1082 storage crashed - https://phabricator.wikimedia.org/T178460#3693168 (10Marostegui)
[10:04:16] <jynus>	 I will start the slave on db1082, not worth keeping it stopped if the server is up
[10:04:49] <marostegui>	 yeah
[10:04:50] <marostegui>	 agreed
[10:05:13] <wikibugs>	 (03PS1) 10Addshore: ci: jenkins, allow access to computer/.*/builds [puppet] - 10https://gerrit.wikimedia.org/r/384960 (https://phabricator.wikimedia.org/T178458)
[10:06:43] <wikibugs>	 10Operations, 10Discovery, 10Recommendation-API, 10Wikidata, and 3 others: flapping monitoring for recommendation_api on scb - https://phabricator.wikimedia.org/T178445#3693178 (10mobrovac) The amount of requests from the Recommendation API service actually makes sense. On each service checker script run,...
[10:20:54] <wikibugs>	 (03CR) 10Elukey: "I am trying to figure out where this use case fits into the https://wikitech.wikimedia.org/wiki/Puppet_coding guideline. It is technically" [puppet] - 10https://gerrit.wikimedia.org/r/384608 (owner: 10Ottomata)
[10:21:05] <moritzm>	 !log imported linux 4.9.30-2+deb9u5~bpo8+1 for jessie-wikimedia
[10:21:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:24:57] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: Create less overhead on bacula jobs when dumping production databases - https://phabricator.wikimedia.org/T162789#3693217 (10jcrespo) To try to solve the previous issue, the following grants have been executed:   ``` root@dbstore2001> set sql_log_bin=0; GRANT SELECT,...
[10:26:27] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10User-Elukey: rack and setup db1107 and db1108 - https://phabricator.wikimedia.org/T177405#3693218 (10elukey) p:05Triage>03Normal
[10:32:12] <wikibugs>	 (03PS7) 10Giuseppe Lavagetto: Port docker builder [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/384081 (https://phabricator.wikimedia.org/T177276)
[10:34:46] <wikibugs>	 (03PS1) 10Elukey: Set PXE boot options and notification disabled for db110[78] [puppet] - 10https://gerrit.wikimedia.org/r/384963 (https://phabricator.wikimedia.org/T177405)
[10:50:52] <moritzm>	 !log rebooting multatuli for kernel update
[10:50:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:57:23] <logmsgbot>	 !log mobrovac@tin Started deploy [restbase/deploy@15619e0]: Introduce the page/summary proxy (no-op for now)
[10:57:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:00:23] <icinga-wm>	 PROBLEM - Restbase root url on restbase1007 is CRITICAL: connect to address 10.64.0.223 and port 7231: Connection refused
[11:00:42] <mobrovac>	 known ^
[11:00:46] <wikibugs>	 (03CR) 10Jcrespo: [C: 031] Set PXE boot options and notification disabled for db110[78] [puppet] - 10https://gerrit.wikimedia.org/r/384963 (https://phabricator.wikimedia.org/T177405) (owner: 10Elukey)
[11:02:21] <icinga-wm>	 RECOVERY - Restbase root url on restbase1007 is OK: HTTP OK: HTTP/1.1 200 - 15742 bytes in 0.025 second response time
[11:02:46] <logmsgbot>	 !log mobrovac@tin Finished deploy [restbase/deploy@15619e0]: Introduce the page/summary proxy (no-op for now) (duration: 05m 23s)
[11:02:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:03:01] <logmsgbot>	 !log mobrovac@tin Started deploy [restbase/deploy@15619e0]: Introduce the page/summary proxy (no-op for now), part #2
[11:03:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:09:31] <wikibugs>	 (03CR) 10Elukey: [C: 032] Set PXE boot options and notification disabled for db110[78] [puppet] - 10https://gerrit.wikimedia.org/r/384963 (https://phabricator.wikimedia.org/T177405) (owner: 10Elukey)
[11:11:54] <logmsgbot>	 !log mobrovac@tin Finished deploy [restbase/deploy@15619e0]: Introduce the page/summary proxy (no-op for now), part #2 (duration: 08m 53s)
[11:12:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:17:34] <wikibugs>	 (03PS1) 10Jcrespo: maridb: Add db1105 to help db1071 because db1082 has crashed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384970 (https://phabricator.wikimedia.org/T178460)
[11:18:07] <wikibugs>	 (03CR) 10Marostegui: [C: 031] maridb: Add db1105 to help db1071 because db1082 has crashed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384970 (https://phabricator.wikimedia.org/T178460) (owner: 10Jcrespo)
[11:18:39] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] maridb: Add db1105 to help db1071 because db1082 has crashed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384970 (https://phabricator.wikimedia.org/T178460) (owner: 10Jcrespo)
[11:19:53] <wikibugs>	 (03Merged) 10jenkins-bot: maridb: Add db1105 to help db1071 because db1082 has crashed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384970 (https://phabricator.wikimedia.org/T178460) (owner: 10Jcrespo)
[11:20:05] <wikibugs>	 (03CR) 10jenkins-bot: maridb: Add db1105 to help db1071 because db1082 has crashed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384970 (https://phabricator.wikimedia.org/T178460) (owner: 10Jcrespo)
[11:21:04] <moritzm>	 !log upgrading mw1261 to wikidiff2 1.5.1
[11:21:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:26:46] <marostegui>	 !log Optimize pagelinks and templatelinks on db1102 s7 - T174509
[11:26:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:26:54] <stashbot>	 T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509
[11:31:49] <moritzm>	 !log upgrading mw1262-mw1265 (canaries) to wikidiff2 1.5.1
[11:31:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:38:01] <wikibugs>	 (03PS1) 10Marostegui: s4.hosts: Add db2084 to s4 [software] - 10https://gerrit.wikimedia.org/r/384971 (https://phabricator.wikimedia.org/T178359)
[11:39:25] <wikibugs>	 (03CR) 10Marostegui: [C: 032] s4.hosts: Add db2084 to s4 [software] - 10https://gerrit.wikimedia.org/r/384971 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui)
[11:40:11] <wikibugs>	 (03Merged) 10jenkins-bot: s4.hosts: Add db2084 to s4 [software] - 10https://gerrit.wikimedia.org/r/384971 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui)
[11:45:34] <logmsgbot>	 !log mobrovac@tin Started deploy [restbase/deploy@99052c1]: Use Cass3 storage for ruwiki summaries
[11:45:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:47:37] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: Add db1105 to help db1071 because db1082 has crashed (duration: 00m 50s)
[11:47:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:54:33] <logmsgbot>	 !log mobrovac@tin Finished deploy [restbase/deploy@99052c1]: Use Cass3 storage for ruwiki summaries (duration: 08m 59s)
[11:54:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:55:14] <jynus>	 high errors on mediawikiwiki, strange
[11:55:53] <jynus>	 do we have an "errors per mw version" view¿
[11:57:43] <jynus>	 mw1265 errors, seem to be due to the upgrade
[11:59:36] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1206 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.008 second response time
[11:59:53] <jynus>	 I think something is wrong with TTMServerMessageUpdateJob
[11:59:56] <icinga-wm>	 PROBLEM - HHVM rendering on mw1206 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.624 second response time
[12:00:09] <jynus>	 since 11:04
[12:00:37] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1206 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.143 second response time
[12:00:54] <jynus>	 mobrovac: something it could be deployed?
[12:00:56] <icinga-wm>	 RECOVERY - HHVM rendering on mw1206 is OK: HTTP OK: HTTP/1.1 200 OK - 74332 bytes in 0.416 second response time
[12:01:42] <mobrovac>	 jynus: the hhvm failures? no, not likely
[12:01:44] <jynus>	 no, it does not start then
[12:03:19] <jynus>	 It is an ongoing issue with ttmserver jobs, I think
[12:04:15] <logmsgbot>	 !log mobrovac@tin Started deploy [restbase/deploy@3c7abf6]: Bug fix: Remove the space in the summary table name
[12:04:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:09:40] <moritzm>	 hhvm dumped core on mw1206, having  a look
[12:09:50] <moritzm>	 that's unrelated to the upgrade on mw1261-mw1265
[12:11:12] <moritzm>	 there's a number of fatal errors caused by a stack overflow in remex-html
[12:11:28] <logmsgbot>	 !log mobrovac@tin Finished deploy [restbase/deploy@3c7abf6]: Bug fix: Remove the space in the summary table name (duration: 07m 13s)
[12:11:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:17:49] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10Thumbor, 10Release-Engineering-Team (Kanban): nutcracker fails to start due to lack of /var/run/nutcracker (ex: deployment-videoscaler01 has memcached failures) - https://phabricator.wikimedia.org/T178457#3692957 (10faidon) nutcracker ships `/usr/lib/tmpfiles....
[12:20:10] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10Thumbor, 10Release-Engineering-Team (Kanban): nutcracker fails to start due to lack of /var/run/nutcracker (ex: deployment-videoscaler01 has memcached failures) - https://phabricator.wikimedia.org/T178457#3692957 (10MoritzMuehlenhoff) deployment-videoscaler01...
[12:28:33] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Kanban): nutcracker fails to start due to lack of /var/run/nutcracker (ex: deployment-videoscaler01 has memcached failures) - https://phabricator.wikimedia.org/T178457#3693635 (10Gilles)
[12:28:45] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Kanban): nutcracker fails to start due to lack of /var/run/nutcracker (ex: deployment-videoscaler01 has memcached failures) - https://phabricator.wikimedia.org/T178457#3692957 (10Gilles) Videoscalers don't run Thumbor
[12:29:25] <wikibugs>	 (03PS1) 10BBlack: browsersec: exclude static/images so the errorpage logo can fetch [puppet] - 10https://gerrit.wikimedia.org/r/384976
[12:30:10] <wikibugs>	 (03CR) 10BBlack: [C: 032] browsersec: exclude static/images so the errorpage logo can fetch [puppet] - 10https://gerrit.wikimedia.org/r/384976 (owner: 10BBlack)
[12:30:29] <wikibugs>	 (03PS3) 10BBlack: errorpage: Migrate from back-compat wmf.png to wmf-logo.png [puppet] - 10https://gerrit.wikimedia.org/r/381274 (owner: 10Krinkle)
[12:30:38] <wikibugs>	 (03CR) 10BBlack: [V: 032 C: 032] errorpage: Migrate from back-compat wmf.png to wmf-logo.png [puppet] - 10https://gerrit.wikimedia.org/r/381274 (owner: 10Krinkle)
[12:30:56] <wikibugs>	 (03PS3) 10BBlack: errorpage: Set explicit height on logo image [puppet] - 10https://gerrit.wikimedia.org/r/381275 (owner: 10Krinkle)
[12:31:02] <wikibugs>	 (03CR) 10BBlack: [V: 032 C: 032] errorpage: Set explicit height on logo image [puppet] - 10https://gerrit.wikimedia.org/r/381275 (owner: 10Krinkle)
[12:34:18] <icinga-wm>	 RECOVERY - High lag on wdqs1003 is OK: OK: Less than 30.00% above the threshold [600.0]
[12:35:34] <wikibugs>	 (03PS3) 10BBlack: cacheproxy: IPv4 PMTUD when blackhole detected [puppet] - 10https://gerrit.wikimedia.org/r/384526
[12:50:02] <wikibugs>	 (03CR) 10Muehlenhoff: Synchronise jenkins package to thirdparty/ci (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/384039 (https://phabricator.wikimedia.org/T158583) (owner: 10Muehlenhoff)
[12:51:37] <wikibugs>	 (03PS2) 10Muehlenhoff: Synchronise jenkins package to thirdparty/ci [puppet] - 10https://gerrit.wikimedia.org/r/384039 (https://phabricator.wikimedia.org/T158583)
[12:55:03] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: rack and setup db1107 and db1108 - https://phabricator.wikimedia.org/T177405#3657517 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['db1107.eqiad.wmnet', 'db11...
[12:59:07] <wikibugs>	 (03PS2) 10Zfilipin: pagePreviews: Restart A/B test on enwiki and dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383378 (https://phabricator.wikimedia.org/T176469) (owner: 10Jdlrobson)
[12:59:09] <wikibugs>	 10Operations, 10Discovery, 10Recommendation-API, 10Wikidata, and 3 others: flapping monitoring for recommendation_api on scb - https://phabricator.wikimedia.org/T178445#3693697 (10Gehel) I'm not sure why we have a retry in the first place. An exponential back-off would be good, or to honor the "Retry-After...
[13:00:04] <jouncebot>	 addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for European Mid-day SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171018T1300).
[13:00:04] <jouncebot>	 dcausse and phuedx: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:11] <zeljkof>	 I can SWAT today
[13:00:14] <dcausse>	 o/
[13:00:15] <phuedx>	 o/
[13:01:14] <zeljkof>	 dcausse, phuedx: do you expect your commit to take a long time to deploy/test?
[13:01:25] <phuedx>	 zeljkof: nope
[13:01:28] <dcausse>	 zeljkof: no
[13:02:16] <zeljkof>	 ok, in that case starting in the calendar order, dcausse first, phuedx second
[13:02:30] <zeljkof>	 dcausse: will ping you when the commit is at mwdebug
[13:02:33] <dcausse>	 ok
[13:03:57] <wikibugs>	 (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383378 (https://phabricator.wikimedia.org/T176469) (owner: 10Jdlrobson)
[13:05:04] <wikibugs>	 (03Merged) 10jenkins-bot: pagePreviews: Restart A/B test on enwiki and dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383378 (https://phabricator.wikimedia.org/T176469) (owner: 10Jdlrobson)
[13:05:49] <moritzm>	 !log uploaded wikidiff2 1.5.1 to apt.wikimedia.org
[13:05:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:06:13] <wikibugs>	 (03CR) 10jenkins-bot: pagePreviews: Restart A/B test on enwiki and dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383378 (https://phabricator.wikimedia.org/T176469) (owner: 10Jdlrobson)
[13:07:09] <zeljkof>	 phuedx: your commit is merged, will ping you in a minute when it's at mwdebug1002
[13:07:15] <phuedx>	 thanks zeljkof 
[13:08:14] <zeljkof>	 phuedx: it's at mwdebug1002
[13:08:19] <phuedx>	 ta
[13:09:47] <zeljkof>	 dcausse: the commit is merged, will be at mwdebug in a minute or two
[13:09:54] <dcausse>	 zeljkof: sure
[13:11:10] <wikibugs>	 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Make WDQS throttling more aggressive - https://phabricator.wikimedia.org/T178491#3693708 (10Gehel)
[13:11:18] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: rack and setup db1107 and db1108 - https://phabricator.wikimedia.org/T177405#3693722 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1108.eqiad.wmnet', 'db1107.eqiad.wmnet'] ```  and were **ALL** successful.
[13:11:28] <elukey>	 \o/
[13:12:44] <phuedx>	 zeljkof: lgtm
[13:13:20] <zeljkof>	 phuedx: ok, deploying
[13:13:29] <zeljkof>	 dcausse: it's at mwdebug1002
[13:13:35] <dcausse>	 zeljkof: ok testing
[13:14:11] <phuedx>	 zeljkof: cool, i'll be watching this graph: https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?orgId=1&var-schema=Popups&from=now-30m&to=now
[13:14:24] <phuedx>	 zeljkof: task number for the deploy subject: T176469
[13:14:24] <stashbot>	 T176469: Relaunch page previews a/b test on en and de wiki - https://phabricator.wikimedia.org/T176469
[13:15:14] <logmsgbot>	 !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:383378|pagePreviews: Restart A/B test on enwiki and dewiki (T176469)]] (duration: 00m 51s)
[13:15:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:23] <zeljkof>	 phuedx: deployed, please check
[13:18:30] <phuedx>	 zeljkof: confirmed that the change is visible in the browser (the config values on enwiki and dewiki are as expected)
[13:19:00] <zeljkof>	 phuedx: great! thanks for deploying with #releng ;)
[13:19:01] <phuedx>	 and seeing events appearing in the pipeline (per the graph)
[13:20:25] <zeljkof>	 dcausse: just checking, do you need more time to test?
[13:20:29] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Kanban): nutcracker fails to start due to lack of /var/run/nutcracker (ex: deployment-videoscaler01 has memcached failures) - https://phabricator.wikimedia.org/T178457#3693745 (10faidon) Ah! Yes, that all makes sense now, thanks!  We ha...
[13:20:35] <dcausse>	 zeljkof: yes
[13:20:45] <zeljkof>	 ok, no rush, just checking
[13:21:00] <gehel>	 !log repooling wdqs1003 now that it has catched up on updates
[13:21:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:48] <dcausse>	 zeljkof: looks good
[13:21:56] <zeljkof>	 dcausse: ok, deploying
[13:22:58] <logmsgbot>	 !log zfilipin@tin Synchronized php-1.31.0-wmf.3/extensions/WikimediaEvents/modules/ext.wikimediaEvents.searchSatisfaction.js: SWAT: [[gerrit:384669|[cirrus] Turn on recall A/B test on enwiki (T177502)]] (duration: 00m 50s)
[13:23:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:06] <stashbot>	 T177502: Deploy A/B test to test relaxing the retrieval query filter - https://phabricator.wikimedia.org/T177502
[13:23:11] <zeljkof>	 dcausse: deployed, please check
[13:23:35] <dcausse>	 zeljkof: perfect, thanks!
[13:23:47] <zeljkof>	 dcausse: thanks for releasing with #releng ;)
[13:28:07] <zeljkof>	 !log EU SWAT finished!
[13:28:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:53] <wikibugs>	 (03PS1) 10Elukey: netboot: prevent db110[78] to be reimaged [puppet] - 10https://gerrit.wikimedia.org/r/384979 (https://phabricator.wikimedia.org/T177405)
[13:38:27] <wikibugs>	 (03CR) 10Herron: "So there is an rsyncd already running on the puppetmasters serving /var/lib/puppet/server/ssl/ca and /var/lib/puppet/volatile. This would " [puppet] - 10https://gerrit.wikimedia.org/r/384834 (https://phabricator.wikimedia.org/T177843) (owner: 10Herron)
[13:41:03] <wikibugs>	 10Operations, 10Analytics, 10DBA: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3693836 (10elukey) Updating this task in light of the recent discussions. The analytics and DBA teams have been fighting a lot with disk space consumption on dbstore1002 due t...
[13:46:31] <wikibugs>	 (03PS2) 10Elukey: netboot: prevent db110[78] to be reimaged [puppet] - 10https://gerrit.wikimedia.org/r/384979 (https://phabricator.wikimedia.org/T177405)
[13:52:39] <wikibugs>	 (03PS1) 10Muehlenhoff: Create /run/nutcracker on stretch onwards [puppet] - 10https://gerrit.wikimedia.org/r/384980 (https://phabricator.wikimedia.org/T178457)
[13:53:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Create /run/nutcracker on stretch onwards [puppet] - 10https://gerrit.wikimedia.org/r/384980 (https://phabricator.wikimedia.org/T178457) (owner: 10Muehlenhoff)
[13:53:20] <wikibugs>	 (03PS2) 10Muehlenhoff: Create /run/nutcracker on stretch onwards [puppet] - 10https://gerrit.wikimedia.org/r/384980 (https://phabricator.wikimedia.org/T178457)
[13:53:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Create /run/nutcracker on stretch onwards [puppet] - 10https://gerrit.wikimedia.org/r/384980 (https://phabricator.wikimedia.org/T178457) (owner: 10Muehlenhoff)
[13:58:13] <wikibugs>	 (03PS3) 10Muehlenhoff: Create /run/nutcracker on stretch onwards [puppet] - 10https://gerrit.wikimedia.org/r/384980 (https://phabricator.wikimedia.org/T178457)
[14:03:14] <chasemp>	 !log create reading-lists cloud project
[14:03:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:03:36] <chasemp>	 !log bump up quota on cyberpower cloud project per T178332
[14:03:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:03:45] <stashbot>	 T178332: Request increased quota for cyberbot Cloud VPS project - https://phabricator.wikimedia.org/T178332
[14:03:55] <chasemp>	 !log add static ip to cloud mwstake project per T178012
[14:04:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:04:05] <stashbot>	 T178012: Request increased quota for mwstake Cloud VPS project - https://phabricator.wikimedia.org/T178012
[14:04:12] <jynus>	 someone is executing SELECT SLEEP(@time) on terbium ...
[14:05:33] <wikibugs>	 (03Abandoned) 10Herron: puppetmaster: add yaml fact directory to rsyncd on frontends [puppet] - 10https://gerrit.wikimedia.org/r/384834 (https://phabricator.wikimedia.org/T177843) (owner: 10Herron)
[14:12:24] <wikibugs>	 10Operations, 10Ops-Access-Requests, 10Research, 10Patch-For-Review: Request public key change for a research fellow - https://phabricator.wikimedia.org/T177889#3693993 (10RobH) The details on a working ssh config are listed here: https://wikitech.wikimedia.org/wiki/Production_shell_access#Standard_config...
[14:16:04] <wikibugs>	 (03PS1) 10BBlack: browsersec: use 302->cacheable-200 pattern instead of 403 [puppet] - 10https://gerrit.wikimedia.org/r/384982
[14:17:48] <wikibugs>	 (03PS4) 10Muehlenhoff: Create /run/nutcracker on stretch onwards [puppet] - 10https://gerrit.wikimedia.org/r/384980 (https://phabricator.wikimedia.org/T178457)
[14:23:43] <moritzm>	 !log uploaded hhvm 3.18.5+dfsg-1+wmf1+deb9u1 to apt.wikimedia.org/stretch-wikimedia
[14:23:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:56] <wikibugs>	 10Operations, 10Collaboration-Team-Triage, 10Notifications, 10Anti-Harassment (AHT Sprint 7), 10Patch-For-Review: Recover Echo Notification Blacklist from Backup - https://phabricator.wikimedia.org/T178313#3694021 (10dbarratt) >>! In T178313#3693001, @jcrespo wrote: > As I commented above, I think by def...
[14:32:11] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1092 - https://phabricator.wikimedia.org/T177264#3694025 (10Cmjohnson) The disk has been replaced
[14:32:21] <wikibugs>	 (03CR) 10Ema: [C: 031] browsersec: use 302->cacheable-200 pattern instead of 403 [puppet] - 10https://gerrit.wikimedia.org/r/384982 (owner: 10BBlack)
[14:39:36] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1101 crashed - memory errors - https://phabricator.wikimedia.org/T178383#3694030 (10Cmjohnson) A new DIMM has been requested with Dell.  You have successfully submitted request SR955416674.
[14:43:31] <wikibugs>	 (03CR) 10BBlack: [C: 032] browsersec: use 302->cacheable-200 pattern instead of 403 [puppet] - 10https://gerrit.wikimedia.org/r/384982 (owner: 10BBlack)
[14:48:16] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1082 storage crashed - https://phabricator.wikimedia.org/T178460#3694047 (10Cmjohnson) A case with HPE has been submitted Your case was successfully submitted. Please note your Case ID: 5323881381 for future reference.
[14:50:29] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1092 - https://phabricator.wikimedia.org/T177264#3652772 (10jcrespo) ``` physicaldrive 1I:1:8 (port 1I:box 1:bay 8, Solid State SATA, 800 GB, Rebuilding) ```
[14:56:38] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: Multiple servers in eqiad D8 showing PSU failures - https://phabricator.wikimedia.org/T177227#3694097 (10Cmjohnson) @herron The server is out of warranty but I took a PSU from a decom server and replaced psu1 on analytics1037 and I no longer see the error.  Take a look an...
[14:56:54] <wikibugs>	 10Operations, 10Release Pipeline, 10monitoring, 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Kanban): Icinga disk space alert when a Docker container is running on an host - https://phabricator.wikimedia.org/T178454#3694099 (10thcipriani)
[15:03:02] <wikibugs>	 10Operations, 10Cloud-Services, 10VPS-project-Phabricator, 10Patch-For-Review: spam from phabricator in labs - https://phabricator.wikimedia.org/T166322#3694113 (10Dzahn) exim-ganglia stats can/should be removed from everything, which will also close this.  also, i haven't received any of this in a long ti...
[15:03:11] <wikibugs>	 10Operations, 10Cloud-Services, 10VPS-project-Phabricator, 10Patch-For-Review: spam from phabricator in labs - https://phabricator.wikimedia.org/T166322#3694114 (10Dzahn) p:05High>03Normal
[15:03:40] <wikibugs>	 10Operations, 10Cloud-Services, 10VPS-project-Phabricator, 10Patch-For-Review: spam from phabricator in labs - https://phabricator.wikimedia.org/T166322#3292437 (10Dzahn) a:03Dzahn
[15:08:11] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1300 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time
[15:08:13] <wikibugs>	 10Operations, 10Cloud-Services, 10VPS-project-Phabricator, 10Patch-For-Review: spam from phabricator in labs - https://phabricator.wikimedia.org/T166322#3694164 (10Paladox) @Dzahn yep, I’ve applied them manually on the puppet master and there somewhere in the git log.
[15:08:21] <icinga-wm>	 PROBLEM - Host labvirt1015.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:09:11] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1300 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.005 second response time
[15:10:08] <wikibugs>	 10Operations, 10Ops-Access-Requests: Requesting access to analytics servers for cwdent - https://phabricator.wikimedia.org/T178406#3694172 (10RobH) a:03cwdent @cwdent:  This makes sense to me!  I'm on clinic duty this week, so I'll be assisting you in getting your access reinstated.  https://wikitech.wikimed...
[15:10:11] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1300 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time
[15:10:12] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1300 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.006 second response time
[15:11:57] <wikibugs>	 10Operations, 10ops-eqiad, 10hardware-requests: Decommission ocg1001-3 - https://phabricator.wikimedia.org/T177958#3694181 (10Dzahn)
[15:13:32] <icinga-wm>	 RECOVERY - Host labvirt1015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.93 ms
[15:14:24] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3694183 (10Cmjohnson) @bd808 I swapped the CPU's to see if the error follows the CPU.  The replacement that I put in there was refurbished so there is a possibility it was ba...
[15:14:42] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1101 crashed - memory errors - https://phabricator.wikimedia.org/T178383#3694184 (10Marostegui) Great!! Thanks!
[15:15:24] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1082 storage crashed - https://phabricator.wikimedia.org/T178460#3694186 (10Marostegui) Thank you!
[15:15:26] <wikibugs>	 10Operations: improve cron spam visibility - https://phabricator.wikimedia.org/T84845#3694187 (10Dzahn) @volans This seems to be what you mentioned in last monitoring meeting when you suggested an Icinga alert for cronspam.
[15:17:09] <wikibugs>	 (03PS1) 10RobH: cwdent to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/384991 (https://phabricator.wikimedia.org/T178406)
[15:18:25] <wikibugs>	 (03PS2) 10RobH: cwdent to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/384991 (https://phabricator.wikimedia.org/T178406)
[15:18:49] <wikibugs>	 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to analytics servers for cwdent - https://phabricator.wikimedia.org/T178406#3694199 (10cwdent) @RobH yes that is the correct group, thanks!  I only used the login a handful of times before but I am pretty sure it was to stat100[45]
[15:19:18] <wikibugs>	 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to analytics servers for cwdent - https://phabricator.wikimedia.org/T178406#3694201 (10RobH) 05Open>03stalled a:05cwdent>03RobH I've prepared the patchset above.  If no objections are noted, the 3 day wait for group additions...
[15:19:56] <wikibugs>	 10Operations, 10monitoring: Cron spam: figure out a way it doesn't get ignored - https://phabricator.wikimedia.org/T178311#3694206 (10Volans) Closing in favour of T84845
[15:20:08] <wikibugs>	 10Operations, 10monitoring: Cron spam: figure out a way it doesn't get ignored - https://phabricator.wikimedia.org/T178311#3694209 (10Volans)
[15:20:10] <wikibugs>	 10Operations: improve cron spam visibility - https://phabricator.wikimedia.org/T84845#931865 (10Volans)
[15:20:18] <wikibugs>	 10Operations, 10monitoring: improve cron spam visibility - https://phabricator.wikimedia.org/T84845#931865 (10Volans)
[15:20:54] <wikibugs>	 10Operations, 10monitoring: improve cron spam visibility - https://phabricator.wikimedia.org/T84845#931865 (10Volans) @Dzahn thanks for pointing this out, I've merged in as duplicate the other task I had opened.
[15:28:32] <wikibugs>	 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Set up octocatalog-diff on host with access to puppetmasters and puppetdb - https://phabricator.wikimedia.org/T177843#3694233 (10herron)
[15:28:50] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add the repository to the name of all generated containers [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/384992
[15:32:49] <wikibugs>	 (03PS1) 10Herron: puppet: temporarily allow puppetcompiler1001 to fetch all catalogs [puppet] - 10https://gerrit.wikimedia.org/r/384993 (https://phabricator.wikimedia.org/T177843)
[15:33:19] <wikibugs>	 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Set up octocatalog-diff on host with access to puppetmasters and puppetdb - https://phabricator.wikimedia.org/T177843#3694240 (10herron)
[15:34:43] <wikibugs>	 10Operations, 10Puppet, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3694242 (10herron)
[15:35:01] <icinga-wm>	 PROBLEM - configured eth on stat1006 is CRITICAL: Return code of 255 is out of bounds
[15:35:02] <icinga-wm>	 PROBLEM - Check size of conntrack table on stat1006 is CRITICAL: Return code of 255 is out of bounds
[15:35:11] <icinga-wm>	 PROBLEM - Disk space on stat1006 is CRITICAL: Return code of 255 is out of bounds
[15:35:26] <elukey>	 somebody is probably hammering stat1006
[15:35:32] <icinga-wm>	 PROBLEM - DPKG on stat1006 is CRITICAL: Return code of 255 is out of bounds
[15:35:32] <icinga-wm>	 PROBLEM - MD RAID on stat1006 is CRITICAL: Return code of 255 is out of bounds
[15:35:42] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on stat1006 is CRITICAL: Return code of 255 is out of bounds
[15:35:51] <icinga-wm>	 PROBLEM - dhclient process on stat1006 is CRITICAL: Return code of 255 is out of bounds
[15:35:53] <icinga-wm>	 PROBLEM - Check systemd state on stat1006 is CRITICAL: Return code of 255 is out of bounds
[15:35:53] <icinga-wm>	 PROBLEM - puppet last run on stat1006 is CRITICAL: Return code of 255 is out of bounds
[15:38:21] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on stat1006 is CRITICAL: Return code of 255 is out of bounds
[15:39:06] <wikibugs>	 (03CR) 10Volans: "Much nicer, thanks for the fixes" (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/384081 (https://phabricator.wikimedia.org/T177276) (owner: 10Giuseppe Lavagetto)
[15:41:14] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: Create less overhead on bacula jobs when dumping production databases - https://phabricator.wikimedia.org/T162789#3694264 (10jcrespo) Latest run does flow and others too, correctly: ``` root@dbstore2001:/srv/backups/x1.20171017220041$ ls -la | grep flowdb -rw-r--r--...
[15:41:22] <wikibugs>	 10Operations, 10ops-ulsfo, 10netops: connect new office link to asw-ulsfo - https://phabricator.wikimedia.org/T176350#3694265 (10RobH) 05stalled>03Resolved I neglected to resolve this, but it was handled within a day or so of our onsite work being completed.
[15:41:51] <icinga-wm>	 RECOVERY - dhclient process on stat1006 is OK: PROCS OK: 0 processes with command name dhclient
[15:41:51] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on stat1006 is OK: OK ferm input default policy is set
[15:41:51] <icinga-wm>	 RECOVERY - Check systemd state on stat1006 is OK: OK - running: The system is fully operational
[15:42:01] <icinga-wm>	 RECOVERY - configured eth on stat1006 is OK: OK - interfaces up
[15:42:10] <icinga-wm>	 RECOVERY - Check size of conntrack table on stat1006 is OK: OK: nf_conntrack is 0 % full
[15:42:11] <icinga-wm>	 RECOVERY - Disk space on stat1006 is OK: DISK OK
[15:42:37] <elukey>	 oomkiller did all the work
[15:43:33] <wikibugs>	 (03PS5) 10Jcrespo: proxysql: Setup proxysql on terbium/wasat as a test [puppet] - 10https://gerrit.wikimedia.org/r/384695 (https://phabricator.wikimedia.org/T175672)
[15:43:46] <moritzm>	 elukey: which process got killer?
[15:45:51] <icinga-wm>	 RECOVERY - puppet last run on stat1006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[15:48:00] <icinga-wm>	 RECOVERY - HP RAID on db1092 is OK: OK: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2 - Controller: OK - Battery/Capacitor: OK
[15:48:31] <wikibugs>	 (03PS6) 10Jcrespo: proxysql: Setup proxysql on terbium/wasat as a test [puppet] - 10https://gerrit.wikimedia.org/r/384695 (https://phabricator.wikimedia.org/T175672)
[15:48:50] <elukey>	 moritzm: a python process that was probably a data crunching one
[15:50:07] <moritzm>	 k
[15:50:43] <wikibugs>	 (03PS7) 10Jcrespo: proxysql: Setup proxysql on terbium/wasat as a test [puppet] - 10https://gerrit.wikimedia.org/r/384695 (https://phabricator.wikimedia.org/T175672)
[15:52:11] <icinga-wm>	 RECOVERY - DPKG on stat1006 is OK: All packages OK
[15:53:29] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1092 - https://phabricator.wikimedia.org/T177264#3694315 (10jcrespo) 05Open>03Resolved a:03jcrespo ``` physicaldrive 1I:1:8 (port 1I:box 1:bay 8, Solid State SATA, 800 GB, OK)  RECOVERY - HP RAID on db1092 is OK: OK: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:...
[15:54:00] <icinga-wm>	 RECOVERY - MD RAID on stat1006 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0
[15:55:55] <wikibugs>	 (03PS8) 10Giuseppe Lavagetto: Port docker builder [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/384081 (https://phabricator.wikimedia.org/T177276)
[15:55:57] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Add the repository to the name of all generated containers [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/384992
[15:56:11] <icinga-wm>	 PROBLEM - DPKG on stat1006 is CRITICAL: Return code of 255 is out of bounds
[15:56:11] <icinga-wm>	 PROBLEM - configured eth on stat1006 is CRITICAL: Return code of 255 is out of bounds
[15:56:21] <icinga-wm>	 PROBLEM - Check size of conntrack table on stat1006 is CRITICAL: Return code of 255 is out of bounds
[15:56:22] <elukey>	 grrr
[15:56:25] <elukey>	 checking again
[15:56:30] <icinga-wm>	 PROBLEM - Disk space on stat1006 is CRITICAL: Return code of 255 is out of bounds
[15:57:00] <icinga-wm>	 PROBLEM - MD RAID on stat1006 is CRITICAL: Return code of 255 is out of bounds
[15:57:00] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on stat1006 is CRITICAL: Return code of 255 is out of bounds
[15:57:01] <icinga-wm>	 PROBLEM - dhclient process on stat1006 is CRITICAL: Return code of 255 is out of bounds
[15:57:01] <icinga-wm>	 PROBLEM - Check systemd state on stat1006 is CRITICAL: Return code of 255 is out of bounds
[15:57:51] <icinga-wm>	 PROBLEM - puppet last run on stat1006 is CRITICAL: Return code of 255 is out of bounds
[16:01:06] <elukey>	 (talking with the owner of the script in analytics)
[16:02:42] <wikibugs>	 (03PS2) 10Herron: puppet: temporarily allow puppetcompiler1001 to fetch all catalogs [puppet] - 10https://gerrit.wikimedia.org/r/384993 (https://phabricator.wikimedia.org/T177843)
[16:04:19] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] puppet: temporarily allow puppetcompiler1001 to fetch all catalogs [puppet] - 10https://gerrit.wikimedia.org/r/384993 (https://phabricator.wikimedia.org/T177843) (owner: 10Herron)
[16:05:14] <wikibugs>	 (03CR) 10Herron: [C: 032] puppet: temporarily allow puppetcompiler1001 to fetch all catalogs [puppet] - 10https://gerrit.wikimedia.org/r/384993 (https://phabricator.wikimedia.org/T177843) (owner: 10Herron)
[16:05:37] <wikibugs>	 (03PS1) 10RobH: install params for new cp40(29|3[012]).ulsfo.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/384997 (https://phabricator.wikimedia.org/T178423)
[16:06:30] <wikibugs>	 (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/compiler02/8368/terbium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/384695 (https://phabricator.wikimedia.org/T175672) (owner: 10Jcrespo)
[16:07:12] <wikibugs>	 (03CR) 10RobH: [C: 032] install params for new cp40(29|3[012]).ulsfo.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/384997 (https://phabricator.wikimedia.org/T178423) (owner: 10RobH)
[16:07:20] <wikibugs>	 (03PS2) 10RobH: install params for new cp40(29|3[012]).ulsfo.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/384997 (https://phabricator.wikimedia.org/T178423)
[16:10:33] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Drop i386 environments from package_builder [puppet] - 10https://gerrit.wikimedia.org/r/384999
[16:10:35] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: package_builder: Change all docs to stretch [puppet] - 10https://gerrit.wikimedia.org/r/385000
[16:10:37] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: package_builder: Switch default distribution to stretch [puppet] - 10https://gerrit.wikimedia.org/r/385001
[16:10:39] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: package_builder: Add buster as an environment [puppet] - 10https://gerrit.wikimedia.org/r/385002
[16:12:03] <icinga-wm>	 RECOVERY - dhclient process on stat1006 is OK: PROCS OK: 0 processes with command name dhclient
[16:12:03] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on stat1006 is OK: OK ferm input default policy is set
[16:12:03] <icinga-wm>	 RECOVERY - Check systemd state on stat1006 is OK: OK - running: The system is fully operational
[16:12:13] <icinga-wm>	 RECOVERY - DPKG on stat1006 is OK: All packages OK
[16:12:13] <icinga-wm>	 RECOVERY - configured eth on stat1006 is OK: OK - interfaces up
[16:12:33] <icinga-wm>	 RECOVERY - Disk space on stat1006 is OK: DISK OK
[16:12:53] <icinga-wm>	 RECOVERY - puppet last run on stat1006 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures
[16:13:03] <icinga-wm>	 RECOVERY - MD RAID on stat1006 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0
[16:13:23] <icinga-wm>	 RECOVERY - Check size of conntrack table on stat1006 is OK: OK: nf_conntrack is 0 % full
[16:19:05] <wikibugs>	 10Operations, 10ops-ulsfo, 10Traffic: rack/setup/install cp40(29|3[012]).ulsfo.wmnet - https://phabricator.wikimedia.org/T178423#3694375 (10RobH)
[16:21:22] <wikibugs>	 (03CR) 10Paladox: [C: 031] "Passes" [puppet] - 10https://gerrit.wikimedia.org/r/384760 (owner: 10Chad)
[16:24:35] <wikibugs>	 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Make WDQS throttling more aggressive - https://phabricator.wikimedia.org/T178491#3694379 (10debt) p:05Triage>03High
[16:25:02] <icinga-wm>	 PROBLEM - puppet last run on mw2127 is CRITICAL: CRITICAL: Puppet has 18 failures. Last run 2 minutes ago with 18 failures. Failed resources (up to 3 shown)
[16:25:03] <icinga-wm>	 PROBLEM - puppet last run on mw1284 is CRITICAL: CRITICAL: Puppet has 38 failures. Last run 2 minutes ago with 38 failures. Failed resources (up to 3 shown): File[/home/robh],File[/home/thcipriani],File[/home/jgreen],File[/home/dduvall]
[16:25:16] <robh>	 ....
[16:25:28] <robh>	 herron: ^ could you change cause that?
[16:25:32] <icinga-wm>	 PROBLEM - puppet last run on mw2188 is CRITICAL: CRITICAL: Puppet has 41 failures. Last run 2 minutes ago with 41 failures. Failed resources (up to 3 shown): File[/etc/fonts/conf.d/10-antialias.conf],File[/usr/local/bin/hhvmadm],File[/usr/local/sbin/hhvm-dump-debug],File[/usr/local/sbin/hhvm-collect-heaps]
[16:25:49] <robh>	 or mine did... not sure
[16:25:52] <icinga-wm>	 PROBLEM - puppet last run on wtp2013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:25:53] <icinga-wm>	 PROBLEM - puppet last run on mw2235 is CRITICAL: CRITICAL: Puppet has 36 failures. Last run 2 minutes ago with 36 failures. Failed resources (up to 3 shown)
[16:25:59] <robh>	 im going to back mine out cuz its easy.
[16:26:02] <icinga-wm>	 PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:26:03] <herron>	 robh possibly
[16:26:12] <icinga-wm>	 PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Puppet has 38 failures. Last run 3 minutes ago with 38 failures. Failed resources (up to 3 shown): File[/etc/apache2/conf-available/50-server-status.conf],File[/usr/local/bin/prometheus-puppet-agent-stats],File[/home/hashar],File[/home/yuvipanda]
[16:26:14] <robh>	 herron: did you wanna revert yours?
[16:26:21] <robh>	 ill elave mine alone since mine is less likely
[16:26:22] <icinga-wm>	 PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:26:27] <herron>	 I bounced apache on the puppetmasters to pick it up.  curious if these will clear on next run
[16:26:28] <robh>	 unles si have a bad regex in my site.pp entry.
[16:26:33] <icinga-wm>	 PROBLEM - puppet last run on analytics1038 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/check_hadoop_yarn_node_state],File[/usr/lib/nagios/plugins/check_timedatectl]
[16:26:52] <icinga-wm>	 PROBLEM - puppet last run on db1069 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:26:53] <icinga-wm>	 PROBLEM - puppet last run on mw2111 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:26:57] <robh>	 but i dont see how my change would do this
[16:27:06] <elukey>	 The proxy server could not handle the request <em><a href="/future/report/mw2127.codfw.wmnet
[16:27:18] <elukey>	 this is on mw2127
[16:27:34] <herron>	 !log restarted apache on puppetmasters
[16:27:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:27:42] <icinga-wm>	 PROBLEM - puppet last run on actinium is CRITICAL: CRITICAL: Puppet has 12 failures. Last run 5 minutes ago with 12 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/wmf_ca_2017_2020.crt],File[/usr/local/bin/phaste],File[/usr/local/lib/nagios/plugins/check_raid],File[/usr/local/lib/nagios/plugins/check_puppetrun]
[16:27:48] <herron>	 a manual puppet run on mw2127 worked
[16:27:52] <icinga-wm>	 PROBLEM - puppet last run on labtestneutron2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:27:53] <icinga-wm>	 PROBLEM - puppet last run on maerlant is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 5 minutes ago with 4 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check-fresh-files-in-dir.py],File[/usr/local/bin/puppet-enabled],File[/usr/lib/nagios/plugins/check_sysctl],File[/etc/sysctl.d]
[16:28:24] <elukey>	 ah so temporary?
[16:28:39] <herron>	 yeah should clear on next run
[16:28:52] <icinga-wm>	 PROBLEM - puppet last run on ganeti1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:28:53] <icinga-wm>	 PROBLEM - puppet last run on cp1052 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:29:02] <icinga-wm>	 PROBLEM - puppet last run on conf1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:29:16] <herron>	 sorry for the noise :(
[16:29:57] <elukey>	 !log stop ircecho on einstenium for puppet shower 
[16:30:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:30:12] <elukey>	 herron: disabled puppet on einstenium and disabled ircecho
[16:30:18] <herron>	 elukey thanks!
[16:30:27] <elukey>	 yw :)
[16:30:34] <elukey>	 will you take care of re-enable it later on?
[16:30:41] <elukey>	 (I might need to log out soonish)
[16:31:07] <herron>	 sure, just enable and run puppet on einsteinium?
[16:32:12] <elukey>	 yep! ircecho will start
[16:33:31] <herron>	 ok will do
[16:37:24] <wikibugs>	 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Make WDQS throttling more aggressive - https://phabricator.wikimedia.org/T178491#3694411 (10debt)
[16:43:48] <wikibugs>	 10Operations, 10Collaboration-Team-Triage, 10Notifications, 10Anti-Harassment (AHT Sprint 7), 10Patch-For-Review: Recover Echo Notification Blacklist from Backup - https://phabricator.wikimedia.org/T178313#3694435 (10dbarratt) Since it does not appear that we lost any data during the deployment, I think...
[16:44:18] <wikibugs>	 10Operations, 10Collaboration-Team-Triage, 10Notifications, 10Anti-Harassment (AHT Sprint 7), 10Patch-For-Review: Recover Echo Notification Blacklist from Backup - https://phabricator.wikimedia.org/T178313#3694437 (10dbarratt) 05Open>03Invalid
[16:49:38] <wikibugs>	 10Operations, 10ops-eqiad, 10hardware-requests: Decommisson restbase-dev100[1-3] - https://phabricator.wikimedia.org/T171179#3694445 (10Cmjohnson)
[16:49:44] <wikibugs>	 10Operations, 10ops-eqiad, 10hardware-requests: Decommisson restbase-dev100[1-3] - https://phabricator.wikimedia.org/T171179#3456666 (10Cmjohnson) 05Open>03Resolved
[16:49:46] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: rack/setup/install restbase-dev100[456] - https://phabricator.wikimedia.org/T166181#3694447 (10Cmjohnson)
[16:56:36] <wikibugs>	 10Operations, 10ops-eqiad: check mw1200 power supply redundancy - https://phabricator.wikimedia.org/T177635#3694461 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson strange but the PSU shows active, green light, hot spare was disabled but in mgmt U/I shows psu2 failed.  I replaced with a psu from a recent d...
[17:00:25] <wikibugs>	 10Operations, 10ops-eqiad: check mw1203 power supply redundancy - https://phabricator.wikimedia.org/T177637#3694488 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson This PSU on the surface appears to be fine, no LED's and racadm showed it normal. The mgmt U?I showed PS1 failed. Replaced from a recent decom....
[17:01:03] <wikibugs>	 (03PS1) 10Dzahn: cyberbot: start simplistic role classes for db/exec [puppet] - 10https://gerrit.wikimedia.org/r/385006
[17:01:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cyberbot: start simplistic role classes for db/exec [puppet] - 10https://gerrit.wikimedia.org/r/385006 (owner: 10Dzahn)
[17:01:35] <wikibugs>	 10Operations, 10ops-eqiad: check mc1016 power supply redundancy - https://phabricator.wikimedia.org/T177634#3694496 (10Cmjohnson) 05Open>03declined This server is alread scheduled for decommission. T164341
[17:01:56] <wikibugs>	 (03PS1) 10Jdlrobson: Use correct name for gadgets popup on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385007 (https://phabricator.wikimedia.org/T178438)
[17:02:24] <wikibugs>	 (03PS2) 10Dzahn: cyberbot: start simplistic role classes for db/exec [puppet] - 10https://gerrit.wikimedia.org/r/385006
[17:02:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cyberbot: start simplistic role classes for db/exec [puppet] - 10https://gerrit.wikimedia.org/r/385006 (owner: 10Dzahn)
[17:04:02] <wikibugs>	 10Operations, 10ops-eqiad, 10Discovery, 10Discovery-Search, 10Elasticsearch: check elastic1022 power supply redundancy - https://phabricator.wikimedia.org/T177631#3694507 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson Verified that the settings are all correct, the server both racadm and web UI do n...
[17:07:25] <wikibugs>	 10Operations, 10ops-eqiad: check kafka1022 power supply status - https://phabricator.wikimedia.org/T177633#3694517 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson This server's warranty expired 2 years ago. The R720XD's have a history of killing power supplies.  I inserted a new power supply and immediatel...
[17:07:56] <wikibugs>	 (03CR) 10Zoranzoki21: [C: 031] Use correct name for gadgets popup on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385007 (https://phabricator.wikimedia.org/T178438) (owner: 10Jdlrobson)
[17:10:10] <wikibugs>	 10Operations, 10ops-eqiad: check kafka1022 power supply status - https://phabricator.wikimedia.org/T177633#3694521 (10Ottomata) Working on it.  https://phabricator.wikimedia.org/T152015  Currently blocked by review of https://gerrit.wikimedia.org/r/#/c/376592/ (cough cough @Volans :) )
[17:11:41] <wikibugs>	 (03PS5) 10Ottomata: Add profile::prometheus::jmx_exporter define to DRY up using jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/384608
[17:12:56] <wikibugs>	 (03PS3) 10Dzahn: cyberbot: start simplistic role classes for db/exec [puppet] - 10https://gerrit.wikimedia.org/r/385006
[17:13:15] <wikibugs>	 (03CR) 10Ottomata: [C: 032] "Talked to Luca, we are going to merge as is.  Will follow up for cassandra usage later after this works for a bit." [puppet] - 10https://gerrit.wikimedia.org/r/384608 (owner: 10Ottomata)
[17:13:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cyberbot: start simplistic role classes for db/exec [puppet] - 10https://gerrit.wikimedia.org/r/385006 (owner: 10Dzahn)
[17:16:24] <wikibugs>	 (03PS4) 10Dzahn: cyberbot: start simplistic role classes for db/exec [puppet] - 10https://gerrit.wikimedia.org/r/385006
[17:16:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cyberbot: start simplistic role classes for db/exec [puppet] - 10https://gerrit.wikimedia.org/r/385006 (owner: 10Dzahn)
[17:19:02] <wikibugs>	 (03PS5) 10Dzahn: cyberbot: start simplistic role classes for db/exec [puppet] - 10https://gerrit.wikimedia.org/r/385006
[17:19:35] <wikibugs>	 (03PS1) 10Ottomata: Move require_package('prometheus-jmx-exporter') to profile::prometheus::jmx_exporter [puppet] - 10https://gerrit.wikimedia.org/r/385008
[17:20:08] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Move require_package('prometheus-jmx-exporter') to profile::prometheus::jmx_exporter [puppet] - 10https://gerrit.wikimedia.org/r/385008 (owner: 10Ottomata)
[17:20:16] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "for labs VPS project by Cyberpower" [puppet] - 10https://gerrit.wikimedia.org/r/385006 (owner: 10Dzahn)
[17:21:05] <wikibugs>	 (03PS2) 10Ottomata: Move require_package('prometheus-jmx-exporter') [puppet] - 10https://gerrit.wikimedia.org/r/385008
[17:22:01] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Move require_package('prometheus-jmx-exporter') [puppet] - 10https://gerrit.wikimedia.org/r/385008 (owner: 10Ottomata)
[17:22:08] <wikibugs>	 (03PS3) 10Ottomata: Move require_package('prometheus-jmx-exporter') [puppet] - 10https://gerrit.wikimedia.org/r/385008
[17:22:10] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] Move require_package('prometheus-jmx-exporter') [puppet] - 10https://gerrit.wikimedia.org/r/385008 (owner: 10Ottomata)
[17:28:45] <wikibugs>	 (03CR) 10Phuedx: [C: 031] Use correct name for gadgets popup on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385007 (https://phabricator.wikimedia.org/T178438) (owner: 10Jdlrobson)
[17:29:19] <wikibugs>	 (03PS1) 10Ottomata: Own jmx exporter config files as root [puppet] - 10https://gerrit.wikimedia.org/r/385009
[17:32:29] <wikibugs>	 (03PS2) 10Ottomata: Own jmx exporter config files as root and 0444 [puppet] - 10https://gerrit.wikimedia.org/r/385009
[17:32:54] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Own jmx exporter config files as root and 0444 [puppet] - 10https://gerrit.wikimedia.org/r/385009 (owner: 10Ottomata)
[17:33:14] <wikibugs>	 (03PS3) 10Ottomata: Own jmx exporter config files as root and 0444 [puppet] - 10https://gerrit.wikimedia.org/r/385009
[17:33:53] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Own jmx exporter config files as root and 0444 [puppet] - 10https://gerrit.wikimedia.org/r/385009 (owner: 10Ottomata)
[17:44:08] <davidwbarratt>	 ping legoktm 
[17:44:19] <legoktm>	 pong-ish
[17:46:31] <legoktm>	 davidwbarratt: ?
[17:47:16] <davidwbarratt>	 legoktm https://phabricator.wikimedia.org/T177667#3680012 able to figure out why xdebug is missing?
[17:48:27] <legoktm>	 no, I haven't had a chance to look into that
[17:48:40] <legoktm>	 I think hashar recently changed how xdebug is installed in CI?
[17:49:20] <wikibugs>	 10Operations, 10ops-ulsfo, 10Traffic: rack/setup/install cp40(29|3[012]).ulsfo.wmnet - https://phabricator.wikimedia.org/T178423#3694640 (10RobH)
[17:51:20] <davidwbarratt>	 idk my bff jill
[17:54:57] <wikibugs>	 10Operations, 10ops-ulsfo, 10Traffic: rack/setup/install cp40(29|3[012]).ulsfo.wmnet - https://phabricator.wikimedia.org/T178423#3694663 (10RobH) cp4030 was ready to install, but now it won't let me connect to com2. I was connected, but my session timed out while I was on via com2, and now it won't let me ba...
[17:56:24] <wikibugs>	 10Operations, 10ops-ulsfo, 10Traffic: rack/setup/install cp40(29|3[012]).ulsfo.wmnet - https://phabricator.wikimedia.org/T178423#3694668 (10RobH)
[17:59:17] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3694673 (10bd808) >>! In T171473#3694183, @Cmjohnson wrote: > Let's monitor and see if the error persists.  We probably need to find a way to put load on this system rather t...
[18:00:04] <jouncebot>	 addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy Morning SWAT (Max 8 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171018T1800).
[18:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[18:00:22] <wikibugs>	 10Operations, 10ops-ulsfo, 10Traffic: rack/setup/install cp40(29|3[012]).ulsfo.wmnet - https://phabricator.wikimedia.org/T178423#3694688 (10RobH)
[18:17:19] <icinga-wm>	 PROBLEM - puppet last run on mw1216 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:19:28] <wikibugs>	 (03PS3) 10Ottomata: Set up Kafka MirrorMaker from main -> jumbo in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/384586 (https://phabricator.wikimedia.org/T177216)
[18:20:43] <wikibugs>	 (03PS4) 10Ottomata: Set up Kafka MirrorMaker from main -> jumbo in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/384586 (https://phabricator.wikimedia.org/T177216)
[18:21:10] <wikibugs>	 (03CR) 10Ottomata: "This works in labs, except I'm unsure of how to test the prometheus jmx exporter part." [puppet] - 10https://gerrit.wikimedia.org/r/384586 (https://phabricator.wikimedia.org/T177216) (owner: 10Ottomata)
[18:21:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Set up Kafka MirrorMaker from main -> jumbo in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/384586 (https://phabricator.wikimedia.org/T177216) (owner: 10Ottomata)
[18:22:54] <wikibugs>	 (03PS5) 10Ottomata: Set up Kafka MirrorMaker from main -> jumbo in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/384586 (https://phabricator.wikimedia.org/T177216)
[18:23:12] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Set up Kafka MirrorMaker from main -> jumbo in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/384586 (https://phabricator.wikimedia.org/T177216) (owner: 10Ottomata)
[18:26:21] <wikibugs>	 (03PS8) 10BBlack: link against jemalloc and tune it a bit [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/384435
[18:26:23] <wikibugs>	 (03PS10) 10BBlack: Release 0.1.0 [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/382873
[18:26:25] <wikibugs>	 (03PS1) 10BBlack: Allow queue memory shrinking [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/385015
[18:26:45] <wikibugs>	 (03PS6) 10Ottomata: Set up Kafka MirrorMaker from main -> jumbo in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/384586 (https://phabricator.wikimedia.org/T177216)
[18:27:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Set up Kafka MirrorMaker from main -> jumbo in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/384586 (https://phabricator.wikimedia.org/T177216) (owner: 10Ottomata)
[18:27:23] <wikibugs>	 (03PS7) 10Ottomata: Set up Kafka MirrorMaker from main -> jumbo in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/384586 (https://phabricator.wikimedia.org/T177216)
[18:41:11] <wikibugs>	 10Operations, 10ops-ulsfo, 10Traffic: rack/setup/install cp40(29|3[012]).ulsfo.wmnet - https://phabricator.wikimedia.org/T178423#3694773 (10RobH)
[18:42:30] <andrewbogott>	 !log restarting rabbitmq-server on labcontrol1001; too many timeouts
[18:42:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:43:12] <wikibugs>	 (03PS3) 10Zoranzoki21: Sort dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383999 (owner: 10Hoo man)
[18:47:26] <icinga-wm>	 RECOVERY - puppet last run on mw1216 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[18:57:49] <wikibugs>	 (03CR) 10BBlack: [V: 032 C: 032] Various minor improvements/updates [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/384165 (owner: 10BBlack)
[18:57:54] <wikibugs>	 (03CR) 10BBlack: [V: 032 C: 032] Remove multi-head support from strq, move into purger. [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/382865 (owner: 10BBlack)
[18:57:55] <andrewbogott>	 !log restarting nova-network on labnet1001
[18:58:00] <wikibugs>	 (03CR) 10BBlack: [V: 032 C: 032] Move all URL parsing and HTTP req generation to receiver [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/382867 (owner: 10BBlack)
[18:58:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:58:05] <wikibugs>	 (03CR) 10BBlack: [V: 032 C: 032] Chain the purgers together and split their stats [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/382868 (owner: 10BBlack)
[18:58:15] <wikibugs>	 (03CR) 10BBlack: [V: 032 C: 032] Bump http-parser upstream src to 2.7.1 + fixups [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/382870 (owner: 10BBlack)
[18:58:16] <wikibugs>	 (03CR) 10BBlack: [V: 032 C: 032] Refactor (rewrite?!) purging code [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/384167 (owner: 10BBlack)
[18:58:19] <wikibugs>	 (03CR) 10BBlack: [V: 032 C: 032] strq+purger: refactor, simplify, add queue delays [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/384433 (owner: 10BBlack)
[18:58:24] <wikibugs>	 (03CR) 10BBlack: [V: 032 C: 032] Rework stats further [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/384434 (owner: 10BBlack)
[18:58:29] <wikibugs>	 (03CR) 10BBlack: [V: 032 C: 032] Allow queue memory shrinking [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/385015 (owner: 10BBlack)
[18:58:35] <wikibugs>	 (03CR) 10BBlack: [V: 032 C: 032] link against jemalloc and tune it a bit [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/384435 (owner: 10BBlack)
[18:58:41] <wikibugs>	 (03CR) 10BBlack: [V: 032 C: 032] Release 0.1.0 [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/382873 (owner: 10BBlack)
[19:00:05] <jouncebot>	 no_justification: Your horoscope predicts another unfortunate MediaWiki train deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171018T1900).
[19:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[19:00:39] <wikibugs>	 (03CR) 10Paladox: [C: 031] Also use scap-deployed version of gerrit.war for actual running of gerrit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/384760 (owner: 10Chad)
[19:00:51] <paladox>	 no_justification i found a mistake in ^^ heh
[19:01:04] <paladox>	 i was wondering why gerrit.war was not showing as a symnlink in bin/
[19:01:27] <MaxSem>	 !log ran LoginNotify/maintenance/migratePreferences.php on test and test2
[19:01:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:02:16] <no_justification>	 paladox: Tbh, we could just run it from the one outside of review_site and ignore bin/ entirely :)
[19:02:25] <paladox>	 oh can we?
[19:02:29] <paladox>	 ah i see
[19:02:36] <no_justification>	 It doesn't matter, since we provide full paths to -d
[19:02:36] <paladox>	 that will need a systemd change
[19:02:42] <no_justification>	 Either way, it's one change
[19:02:45] <paladox>	 yeh
[19:03:12] <wikibugs>	 (03CR) 10Paladox: [C: 031] Also use scap-deployed version of gerrit.war for actual running of gerrit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/384760 (owner: 10Chad)
[19:03:15] <no_justification>	 This is probably safer and doesn't need a restart
[19:03:18] <no_justification>	 So amending
[19:03:22] <paladox>	 yep
[19:03:28] <wikibugs>	 (03PS2) 10Chad: Also use scap-deployed version of gerrit.war for actual running of gerrit [puppet] - 10https://gerrit.wikimedia.org/r/384760
[19:04:25] <wikibugs>	 (03CR) 10Paladox: [C: 031] "thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/384760 (owner: 10Chad)
[19:04:59] <wikibugs>	 (03PS1) 10BBlack: Merge branch 'master' into debian [software/varnish/vhtcpd] (debian) - 10https://gerrit.wikimedia.org/r/385018
[19:05:01] <wikibugs>	 (03PS1) 10BBlack: vhtcpd (0.1.0-1) unstable; urgency=low [software/varnish/vhtcpd] (debian) - 10https://gerrit.wikimedia.org/r/385019
[19:08:52] <wikibugs>	 (03CR) 10Smalyshev: wdqs: LVS check should reach blazegraph and do a simple query (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/384938 (owner: 10Gehel)
[19:12:47] <paladox>	 im guessing tommror we can dismantle the gerrit deb repo :). If things are still ok.
[19:13:58] <wikibugs>	 (03CR) 10Ladsgroup: [C: 031] "That's a great idea." [puppet] - 10https://gerrit.wikimedia.org/r/384951 (owner: 10Hoo man)
[19:15:21] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3694836 (10chasemp) @Andrew scheduled 20 instances to this server and 4 think they came up and the rest failed.    ```2017-10-18 18:13:49.714 2530 ERROR nova.compute.manager...
[19:15:43] <andrewbogott>	 !log stopping nodepool temporarily to give rabbit a chance to catch up
[19:15:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:17:30] <andrewbogott>	 !log restarting nodepool
[19:17:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:18:45] <wikibugs>	 (03CR) 10BBlack: [V: 032 C: 032] Merge branch 'master' into debian [software/varnish/vhtcpd] (debian) - 10https://gerrit.wikimedia.org/r/385018 (owner: 10BBlack)
[19:18:52] <wikibugs>	 (03CR) 10BBlack: [V: 032 C: 032] vhtcpd (0.1.0-1) unstable; urgency=low [software/varnish/vhtcpd] (debian) - 10https://gerrit.wikimedia.org/r/385019 (owner: 10BBlack)
[19:19:55] <bblack>	 !log uploaded vhtcpd-0.1.0-1 to jessie-wikimedia, testing on cp1008 only for now
[19:20:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:24:21] <icinga-wm>	 PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:24:21] <icinga-wm>	 PROBLEM - puppet last run on mw1212 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:24:32] <icinga-wm>	 PROBLEM - puppet last run on ganeti1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:24:51] <icinga-wm>	 PROBLEM - puppet last run on elastic1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:25:02] <icinga-wm>	 PROBLEM - puppet last run on ununpentium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:25:11] <icinga-wm>	 PROBLEM - puppet last run on elastic1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:25:32] <icinga-wm>	 PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:25:32] <icinga-wm>	 PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:25:41] <icinga-wm>	 PROBLEM - puppet last run on mw1220 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:25:41] <icinga-wm>	 PROBLEM - puppet last run on restbase1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:25:51] <icinga-wm>	 PROBLEM - puppet last run on db1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:25:52] <icinga-wm>	 PROBLEM - puppet last run on elastic1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:25:52] <icinga-wm>	 PROBLEM - puppet last run on cp1052 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:26:02] <icinga-wm>	 PROBLEM - puppet last run on mwlog1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:26:12] <icinga-wm>	 PROBLEM - puppet last run on mc1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:26:23] <chasemp>	 ^did someone merge something invasive recently?
[19:26:32] <icinga-wm>	 PROBLEM - puppet last run on releases1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:26:42] <icinga-wm>	 PROBLEM - puppet last run on mc1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:26:42] <icinga-wm>	 PROBLEM - puppet last run on mc1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:26:42] <icinga-wm>	 PROBLEM - puppet last run on db1062 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:26:42] <icinga-wm>	 PROBLEM - puppet last run on mw1273 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:26:52] <icinga-wm>	 PROBLEM - puppet last run on phab1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:27:01] <icinga-wm>	 PROBLEM - puppet last run on thumbor1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:27:02] <icinga-wm>	 PROBLEM - puppet last run on mw1284 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:27:11] <icinga-wm>	 PROBLEM - puppet last run on mw1204 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:27:11] <icinga-wm>	 PROBLEM - puppet last run on mw1258 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:27:12] <icinga-wm>	 PROBLEM - puppet last run on ms-be1032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:27:12] <icinga-wm>	 PROBLEM - puppet last run on planet1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:27:20] <mutante>	 chasemp: picked random one, elastic1036, no issue there. so looks like master
[19:27:20] <paladox>	 chasemp i think it's puppetdb.
[19:27:22] <chasemp>	 ...seems to be transient where I checked
[19:27:35] <chasemp>	 mutante: same experience here
[19:27:41] <icinga-wm>	 PROBLEM - puppet last run on dbproxy1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:28:02] <icinga-wm>	 PROBLEM - puppet last run on mw1193 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:28:02] <icinga-wm>	 PROBLEM - puppet last run on logstash1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:28:11] <icinga-wm>	 PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:28:12] <icinga-wm>	 PROBLEM - puppet last run on conf1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:28:22] <icinga-wm>	 PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:29:01] <icinga-wm>	 PROBLEM - puppet last run on labstore1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:29:01] <icinga-wm>	 PROBLEM - puppet last run on db1069 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:29:11] <icinga-wm>	 PROBLEM - puppet last run on labnodepool1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:29:13] <wikibugs>	 (03CR) 10Zoranzoki21: ":D" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383999 (owner: 10Hoo man)
[19:29:21] <icinga-wm>	 PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:29:21] <icinga-wm>	 PROBLEM - puppet last run on prometheus1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:29:21] <icinga-wm>	 PROBLEM - puppet last run on ms-be1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:29:31] <icinga-wm>	 PROBLEM - puppet last run on mw1224 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:29:41] <icinga-wm>	 PROBLEM - puppet last run on rdb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:30:01] <icinga-wm>	 PROBLEM - puppet last run on restbase-dev1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:30:01] <icinga-wm>	 PROBLEM - puppet last run on eeden is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:30:02] <icinga-wm>	 PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:30:02] <icinga-wm>	 PROBLEM - puppet last run on analytics1045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:30:11] <icinga-wm>	 PROBLEM - puppet last run on mw1195 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:30:32] <icinga-wm>	 PROBLEM - puppet last run on dbproxy1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:30:41] <icinga-wm>	 PROBLEM - puppet last run on ms-fe1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:30:41] <icinga-wm>	 PROBLEM - puppet last run on naos is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:30:51] <icinga-wm>	 RECOVERY - puppet last run on elastic1036 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[19:31:25] <mutante>	 puppetdb is running.. but i think paladox is right that it was a short outage there
[19:31:28] <mutante>	 and it's recovering now
[19:31:31] <icinga-wm>	 RECOVERY - puppet last run on releases1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[19:31:39] <chasemp>	 I saw a lot of transient puppetmaster1002 puppet-master[8600]: The environment must be purely alphanumeric, not ''
[19:31:49] <chasemp>	 but I'm unsure if that's normal noise or not as there is a lot of noise
[19:32:04] <wikibugs>	 (03PS1) 10RobH: lvs400[567] production dns entries [dns] - 10https://gerrit.wikimedia.org/r/385022 (https://phabricator.wikimedia.org/T178436)
[19:32:20] <mutante>	 chasemp: that looks like normal noise to me
[19:32:21] <icinga-wm>	 PROBLEM - puppet last run on labvirt1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:32:27] <mutante>	 seen that a lot without this issue
[19:35:22] <wikibugs>	 (03CR) 10RobH: [C: 032] lvs400[567] production dns entries [dns] - 10https://gerrit.wikimedia.org/r/385022 (https://phabricator.wikimedia.org/T178436) (owner: 10RobH)
[19:35:41] <chasemp>	 k, I see almost all successes at the moment but am not sure what was the deal
[19:35:49] <chasemp>	 gotta hop in a meeting in a minute tho
[19:36:12] <herron>	 hmm I was just running some catalog diffs. wonder if that is related? 
[19:36:24] <robh>	 if it bogs down the db likely...
[19:36:26] <herron>	 could have increased the load on the masters
[19:36:33] <robh>	 puppetdb that is
[19:36:42] <herron>	 but it was one at a time with a sleep in between each
[19:41:05] <wikibugs>	 (03PS1) 10RobH: new ulsfo lvs systems install params [puppet] - 10https://gerrit.wikimedia.org/r/385023 (https://phabricator.wikimedia.org/T178436)
[19:43:39] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3694964 (10Andrew) I think the VM creation failure was a (mostly? completely?) unrelated issue.  I've rescheduled some actually running VMs there, and will see how they do.
[19:51:42] <icinga-wm>	 RECOVERY - puppet last run on mc1020 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures
[19:51:52] <icinga-wm>	 RECOVERY - puppet last run on phab1001 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures
[19:52:02] <icinga-wm>	 RECOVERY - puppet last run on mw1284 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures
[19:52:12] <icinga-wm>	 RECOVERY - puppet last run on ms-be1032 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures
[19:52:12] <icinga-wm>	 RECOVERY - puppet last run on planet1001 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures
[19:53:02] <icinga-wm>	 RECOVERY - puppet last run on mw1193 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures
[19:53:02] <icinga-wm>	 RECOVERY - puppet last run on logstash1002 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures
[19:53:11] <icinga-wm>	 RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[19:53:21] <icinga-wm>	 RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[19:54:01] <icinga-wm>	 RECOVERY - puppet last run on db1069 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures
[19:54:11] <icinga-wm>	 RECOVERY - puppet last run on labnodepool1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[19:54:21] <icinga-wm>	 RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures
[19:54:21] <icinga-wm>	 RECOVERY - puppet last run on cp1053 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[19:54:21] <icinga-wm>	 RECOVERY - puppet last run on prometheus1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[19:54:21] <icinga-wm>	 RECOVERY - puppet last run on ms-be1020 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[19:54:21] <icinga-wm>	 RECOVERY - puppet last run on mw1212 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[19:54:31] <icinga-wm>	 RECOVERY - puppet last run on mw1224 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[19:54:32] <icinga-wm>	 RECOVERY - puppet last run on ganeti1006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[19:54:41] <icinga-wm>	 RECOVERY - puppet last run on rdb1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[19:54:51] <icinga-wm>	 RECOVERY - puppet last run on elastic1047 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[19:54:52] <icinga-wm>	 RECOVERY - puppet last run on restbase-dev1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[19:55:01] <icinga-wm>	 RECOVERY - puppet last run on eeden is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[19:55:02] <icinga-wm>	 RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[19:55:02] <icinga-wm>	 RECOVERY - puppet last run on analytics1045 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures
[19:55:02] <icinga-wm>	 RECOVERY - puppet last run on ununpentium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[19:55:11] <icinga-wm>	 RECOVERY - puppet last run on mw1195 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[19:55:11] <icinga-wm>	 RECOVERY - puppet last run on elastic1023 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[19:55:32] <icinga-wm>	 RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[19:55:32] <icinga-wm>	 RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[19:55:32] <icinga-wm>	 RECOVERY - puppet last run on dbproxy1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[19:55:42] <icinga-wm>	 RECOVERY - puppet last run on mw1220 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[19:55:42] <icinga-wm>	 RECOVERY - puppet last run on ms-fe1007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[19:55:42] <icinga-wm>	 RECOVERY - puppet last run on naos is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[19:55:42] <icinga-wm>	 RECOVERY - puppet last run on restbase1012 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[19:55:51] <icinga-wm>	 RECOVERY - puppet last run on db1020 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[19:55:52] <icinga-wm>	 RECOVERY - puppet last run on cp1052 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[19:56:02] <icinga-wm>	 RECOVERY - puppet last run on mwlog1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[19:56:08] <wikibugs>	 (03PS1) 10Ottomata: Temporarily allow rsync access to FRACK hosts. [puppet] - 10https://gerrit.wikimedia.org/r/385027 (https://phabricator.wikimedia.org/T178509)
[19:56:12] <icinga-wm>	 RECOVERY - puppet last run on mc1028 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[19:56:37] <ottomata>	 Jeff_Green:  ^^
[19:56:42] <icinga-wm>	 RECOVERY - puppet last run on mc1034 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[19:56:42] <icinga-wm>	 RECOVERY - puppet last run on db1062 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[19:56:42] <icinga-wm>	 RECOVERY - puppet last run on mw1273 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[19:57:01] <icinga-wm>	 RECOVERY - puppet last run on thumbor1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[19:57:11] <icinga-wm>	 RECOVERY - puppet last run on mw1204 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[19:57:11] <icinga-wm>	 RECOVERY - puppet last run on mw1258 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[19:57:13] <Jeff_Green>	 ottomata: great. I'm doing terrible queries now :-)
[19:57:21] <icinga-wm>	 RECOVERY - puppet last run on labvirt1013 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[19:57:41] <icinga-wm>	 RECOVERY - puppet last run on dbproxy1009 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[19:57:41] <ottomata>	 dont' forget to limit to the partitions you need :D
[19:58:11] <icinga-wm>	 RECOVERY - puppet last run on conf1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[19:58:43] <Jeff_Green>	 limit to partitions? I've got the date limits and the hostname and path stuff, I'm not sure what you mean beyond that?
[19:59:01] <icinga-wm>	 RECOVERY - puppet last run on labstore1007 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[19:59:53] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Temporarily allow rsync access to FRACK hosts. [puppet] - 10https://gerrit.wikimedia.org/r/385027 (https://phabricator.wikimedia.org/T178509) (owner: 10Ottomata)
[20:00:04] <jouncebot>	 gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: It is that lovely time of the day again! You are hereby commanded to deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / …. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171018T2000).
[20:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[20:00:08] <ottomata>	 date limits
[20:00:11] <ottomata>	 are mostly
[20:00:13] <ottomata>	 you can also limit to text data
[20:00:22] <awight>	 jouncebot: Nothing for ORES today
[20:00:36] <ottomata>	 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Queries#Always_restrict_queries_to_a_date_range_.28partitioning.29
[20:00:45] <ottomata>	 jgreen so ya, the dates you need
[20:00:45] <ottomata>	 and
[20:00:49] <ottomata>	 webrequest_source='text' 
[20:00:55] <ottomata>	 that way you don't also have to read upload, etc.
[20:01:41] <Jeff_Green>	 oh ho
[20:02:32] <Jeff_Green>	 i'll do that next time, it's too late for the current query which is currently writing files
[20:02:51] <Jeff_Green>	 is there a way to consolidate output into a single file instead of the many in dir thing?
[20:04:59] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): nutcracker fails to start due to lack of /var/run/nutcracker (ex: deployment-videoscaler01 has memcached failures) - https://phabricator.wikimedia.org/T178457#3694999 (10hashar) @gilles sorry for the spam...
[20:05:05] <wikibugs>	 (03CR) 10Hashar: Create /run/nutcracker on stretch onwards (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/384980 (https://phabricator.wikimedia.org/T178457) (owner: 10Muehlenhoff)
[20:05:35] <bearND>	 nothing for mobileapps today
[20:08:24] <ottomata>	 Jeff_Green:  from hive not really
[20:08:33] <Jeff_Green>	 ok
[20:08:34] <ottomata>	 since the files are being written by individual process on various nodes across the cluster
[20:08:42] <ottomata>	 but you can do it when you copy to your home dir out of hdfs
[20:08:55] <Jeff_Green>	 yup yup, easy enough
[20:08:58] <ottomata>	 hdfs dfs -text /path/to/files/in/hdfs/* > localfile
[20:09:13] <ottomata>	 or -copyToLocal and hten cat * > localfile
[20:09:34] <Jeff_Green>	 ok
[20:14:29] <Jeff_Green>	 ottomata: I've been using "overwrite local directory" and so on, does is that redundant with what's created in /mnt/hdfs/.../jgreen/ ?
[20:14:36] <wikibugs>	 (03PS2) 10RobH: new ulsfo lvs systems install params [puppet] - 10https://gerrit.wikimedia.org/r/385023 (https://phabricator.wikimedia.org/T178436)
[20:15:05] <wikibugs>	 (03CR) 10RobH: [C: 032] new ulsfo lvs systems install params [puppet] - 10https://gerrit.wikimedia.org/r/385023 (https://phabricator.wikimedia.org/T178436) (owner: 10RobH)
[20:16:28] <ebernhardson>	 in hive you can set some configs to cause hive to collapse files together. Only really useful if its a script that runs regularly or something though. See lines 29-37: https://gerrit.wikimedia.org/r/#/c/317019/12/oozie/query_clicks/daily/query_clicks_daily.hql
[20:16:32] <XioNoX>	 !log Changing MTUs on interfaces to NTT
[20:16:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:17:21] <ottomata>	 oh ebernhardson
[20:17:22] <ottomata>	 -- Enable result file merging
[20:17:22] <ottomata>	 set hive.merge.mapfiles=true;
[20:17:22] <ottomata>	 set hive.merge.mapredfiles=true;
[20:17:24] <ottomata>	 didn't know  about that!
[20:17:25] <ottomata>	 really?
[20:17:28] <ottomata>	 that's cool
[20:17:32] <ottomata>	 joal:  ^^^ did you know about that?
[20:17:55] <ottomata>	 Jeff_Green: not sure i understand your question
[20:17:56] <ottomata>	 but just FYI
[20:18:15] <ottomata>	  /mnt/hdfs is just a read only mount to the HDFS file system, so that you can cd, ls, etc. into it
[20:18:34] <ottomata>	 for your case its probably find to copy files directly from it, since I think your data is relatively small(?), but if it isn't
[20:18:40] <ottomata>	 it'd be better to get it using the hdfs dfs CLI
[20:18:51] <Jeff_Green>	 I've been working from the example you gave me earlier, https://gerrit.wikimedia.org/r/#/c/327003/1/oozie/webrequest/legacy_tsvs/generate_sampled-1000_tsv.hql -- in that query there's an local output directory specified
[20:18:56] <ottomata>	 hdfs dfs -ls, hdfs dfs -get, hdfs dfs -copyToLocal, -text, etc.
[20:19:17] <wikibugs>	 10Operations, 10ops-ulsfo, 10Traffic: rack/setup/install lvs400[567].ulsfo.wmnet - https://phabricator.wikimedia.org/T178436#3695031 (10RobH)
[20:19:30] <ottomata>	 ya, whatever you put for directory there is an HDFS directory
[20:19:52] <Jeff_Green>	 the landingpages stuff is a few GB for the ~8 days in question, the /beacon/impressions stuff (which we normally sample 1:10) is much larger I think 
[20:20:04] <ottomata>	 so yes, should be in HDFS somewhere, (and vailable in the /mnt/hdfs mount)
[20:20:32] <ottomata>	 k, might be faster / safer (/mnt/hdfs *usually* works, but not always) to use hdfs CLI
[20:20:34] <ottomata>	 hdfs dfs
[20:21:26] <ebernhardson>	 ottomata: yea it seems to work, result files usually end up somewhere between 128M and 256M
[20:21:32] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 42 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[20:21:52] <Jeff_Green>	 ottomata: hdfs cli as opposed to using hive at all?
[20:23:11] <icinga-wm>	 PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 115 probes of 293 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map
[20:23:48] <ottomata>	 Jeff_Green:  no no, to copy the files that Hive creates out of HDFS
[20:24:06] <ebernhar|lunch>	 wont those files be hard to parse, as they will be in some hadoop format like parquet?
[20:24:19] <ottomata>	 Jeff_Green:  is generating TSVs using Hive :p
[20:24:22] <ebernhar|lunch>	 oh
[20:24:28] <ebernhar|lunch>	 then it might work :)
[20:24:29] <ottomata>	 https://gerrit.wikimedia.org/r/#/c/327003/1/oozie/webrequest/legacy_tsvs/generate_sampled-1000_tsv.hql
[20:24:32] <ottomata>	 something like that ^
[20:24:41] <ottomata>	 Jeff_Green:  once Hive is done
[20:24:46] <ottomata>	 you'll have multiple files in HDFS
[20:25:01] <ottomata>	 to get single files, probably easiest to just cat them together using hdfs cli into a local file
[20:25:03] <ottomata>	 somethign like
[20:25:06] <Jeff_Green>	 right, they look like tsvs in the output format specified and there's 
[20:25:26] <ottomata>	 hdfs dfs -text /path/to/hive/output/in/hdfs/* > localfile.tsv
[20:25:29] <Jeff_Green>	 there are also .crc files
[20:25:34] <ottomata>	 oh ya
[20:25:38] <ottomata>	 wait
[20:25:41] <ottomata>	 in hdfs?
[20:25:58] <Jeff_Green>	 I don't know :-) this is what I'm trying to understand
[20:26:32] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 7 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[20:27:11] <Jeff_Green>	 that example you posted above writes a whole lot of output files in whatever dir you specify, for the landingpages query there are 31K output files, half of which are *.crc
[20:29:11] <Jeff_Green>	 each one of the non-crc files contains a batch of logs in the TSV output format specified in the CONCAT() in that query
[20:33:11] <icinga-wm>	 RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 1 probes of 293 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map
[20:33:24] <wikibugs>	 10Operations, 10Ops-Access-Requests, 10Research, 10Patch-For-Review: Request public key change for a research fellow - https://phabricator.wikimedia.org/T177889#3695061 (10RobH) @Cervisiarius: I'm also on clinic duty this week, so you should feel free to ping for assistance via IRC.  Sometimes it is easier...
[20:33:57] <Jeff_Green>	 ottomata: you can see the queries I'm running in stat1005:/home/jgreen/query_notes.txt and the output of the first is on disk already in /home/jgreen/landing
[20:34:12] <Jeff_Green>	 that said I gotta change venues, I'll be back in ~15
[20:42:16] <ottomata>	 OH
[20:42:19] <ottomata>	 LOCAL DIRECTORY
[20:42:20] <ottomata>	 huh.
[20:42:21] <ottomata>	 cool
[20:42:31] <ottomata>	 then you dont' need hdfs dfs CLI
[20:42:36] <ottomata>	 ok, understand why you have .crcs now
[20:42:37] <ottomata>	 cool
[20:42:45] <ottomata>	 you don't really need those, you can probably just rm them when its done
[20:43:03] <ottomata>	 UHHH
[20:43:30] <ottomata>	 AHH Jeff_Green come back!
[20:43:33] <ottomata>	 you are doing a crazy thing
[20:55:46] <Jeff_Green>	 ottomata: still around?
[20:56:26] <wikibugs>	 10Operations, 10ops-ulsfo, 10Traffic: rack/setup/install lvs400[567].ulsfo.wmnet - https://phabricator.wikimedia.org/T178436#3695131 (10RobH)
[21:03:33] <wikibugs>	 10Operations, 10Traffic: rack/setup/install lvs400[567].ulsfo.wmnet - https://phabricator.wikimedia.org/T178436#3695158 (10RobH) a:05RobH>03None
[21:04:01] <wikibugs>	 10Operations, 10Traffic: rack/setup/install lvs400[567].ulsfo.wmnet - https://phabricator.wikimedia.org/T178436#3692125 (10RobH) This is now ready for someone in #traffic to take over for service implementation/replacement of the existing lvs400[1-4].
[21:04:03] <ottomata>	 Jeff_Green: 
[21:04:03] <ottomata>	 ya
[21:04:06] <ottomata>	 you are doing some crazy stuff!
[21:04:07] <ottomata>	 haha
[21:04:16] <ottomata>	 you still ahve TABLESAMPLE(BUCKET 1 OUT OF 10 ON rand())
[21:04:19] <ottomata>	 you want that?
[21:04:20] <Jeff_Green>	 :-$
[21:04:56] <Jeff_Green>	 yeah, we sample 1:10 for /beacon/impressions specifically, i think it was a preposterous amount of data otherwise and we don't really need that level of precision
[21:05:02] <ottomata>	 ohhh ok
[21:05:06] <ottomata>	 sorry maybe not so crazy then
[21:05:07] <Jeff_Green>	 we do that now at the kafkatee pipe
[21:05:11] <ottomata>	 ahhh
[21:05:11] <ottomata>	 ok cool
[21:05:23] <ottomata>	 ok so i see what you are doing then, this looks fine.  LOCAL DIRECTORY makes sense
[21:05:24] <Jeff_Green>	 also when I ran it without that the output was >150GB :-)
[21:05:31] <ottomata>	 and i see why you have .crcs then
[21:05:34] <ottomata>	 you wont' need those
[21:05:42] <ottomata>	 you can delete them after hive is done
[21:05:50] <ottomata>	 but ya, you've written this to local FS, not HDFS
[21:05:53] <ottomata>	 so you don't need hdfs cli
[21:06:01] <ottomata>	 you can do whatever you want to cat the files together
[21:07:58] <Jeff_Green>	 I'm a little puzzled what the output is, in the local directory. Here https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Writingdataintothefilesystemfromqueries they talk about statements "to HDFS filesystem" being the most efficient, I'm pretty sure that's ~not~ what's happening specifying a directory in my homedir, but not positive
[21:08:03] <wikibugs>	 (03PS3) 10Dzahn: Also use scap-deployed version of gerrit.war for actual running of gerrit [puppet] - 10https://gerrit.wikimedia.org/r/384760 (owner: 10Chad)
[21:09:32] <wikibugs>	 10Operations, 10ops-ulsfo, 10Traffic: decommission lvs400[1-4].ulsfo.wmnet - https://phabricator.wikimedia.org/T178535#3695169 (10RobH)
[21:09:36] <wikibugs>	 10Operations, 10Traffic: rack/setup/install lvs400[567].ulsfo.wmnet - https://phabricator.wikimedia.org/T178436#3692125 (10RobH)
[21:09:38] <wikibugs>	 10Operations, 10ops-ulsfo, 10Traffic: decommission lvs400[1-4].ulsfo.wmnet - https://phabricator.wikimedia.org/T178535#3695169 (10RobH)
[21:10:38] <ottomata>	 Jeff_Green:  that's right
[21:10:41] <ottomata>	 by doing LOCAL DIRECTORY
[21:10:43] <ottomata>	 you are not using HDFS
[21:10:48] <Jeff_Green>	 ok
[21:11:05] <ottomata>	 so that means that every process across the cluster that would have written data to the local disk where they run
[21:11:11] <ottomata>	 instead is beaming the data to your local client
[21:11:12] <wikibugs>	 (03CR) 10Dzahn: [C: 032] Also use scap-deployed version of gerrit.war for actual running of gerrit [puppet] - 10https://gerrit.wikimedia.org/r/384760 (owner: 10Chad)
[21:11:18] <ottomata>	 and writing to your local disk
[21:11:25] <Jeff_Green>	 ok
[21:11:35] <ottomata>	 you would need to pull it down to your homedir eventually anyway
[21:11:36] <ottomata>	 so this is fine
[21:12:04] <Jeff_Green>	 when that's done I'll just rsync them to americum and parse/sort/split into 15 minute increment logfiles to clobber the bad ones
[21:12:28] <Jeff_Green>	 meanwhile kafkatee is still running clean since 10/4
[21:12:59] <Jeff_Green>	 and casey is working on packaging & testing librdkafka 0.9.6 and the latest kafkatee version
[21:13:00] <ottomata>	 great
[21:13:14] <ottomata>	 Jeff_Green:  you have to split into 15 min logfiles?  why?
[21:13:15] <ottomata>	 just curious
[21:14:29] <Jeff_Green>	 we did that historically to keep the reporting lag low, I can probably backfill in larger chunks but it's easy enough to automate
[21:14:50] <wikibugs>	 (03PS1) 10Odder: Add high-density logos for the English Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385094 (https://phabricator.wikimedia.org/T177506)
[21:15:28] <Jeff_Green>	 I'm not sure yet what they have to do at the parser to rerun that timeframe...
[21:16:17] <Jeff_Green>	 did you see I was finally able to pinpoint the recovery time?
[21:20:31] <ottomata>	 Jeff_Green:  yeah but it didn't correlate with anyting, right?
[21:22:05] <Jeff_Green>	 not that I've found, no
[21:22:43] <Jeff_Green>	 also the fact that we saw loss after that window with kafkacat on alnitak doesn't bode well for an isolated event
[21:23:21] <Jeff_Green>	 woot. rsync is working
[21:24:42] <paladox>	 no_justification hi, with your patch, it will use the scap deployed version which is 2.13.9 compared to the run we are curretnly running.
[21:24:48] <paladox>	 do we need to restart gerrit
[21:24:49] <paladox>	 ?
[21:25:16] <no_justification>	 Eh, we should upgrade anyway
[21:25:22] <paladox>	 ok thanks
[21:25:47] <paladox>	 mutante ^^
[21:30:03] <mutante>	 no_justification: so you want to do the upgrade right now?  i can submit it
[21:30:12] <mutante>	 i am also on a bus, but it's fine :)
[21:30:17] <no_justification>	 I'll do the upgrade first, uno momento
[21:30:21] <mutante>	 ok
[21:32:58] <icinga-wm>	 PROBLEM - gerrit process on cobalt is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xmx20g -jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site
[21:32:58] <icinga-wm>	 PROBLEM - Check systemd state on cobalt is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[21:33:52] <no_justification>	 Gerrit's back
[21:33:58] <icinga-wm>	 RECOVERY - gerrit process on cobalt is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xmx20g -jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site
[21:33:58] <icinga-wm>	 RECOVERY - Check systemd state on cobalt is OK: OK - running: The system is fully operational
[21:34:11] <paladox>	 https://gerrit.wikimedia.org/r/
[21:34:12] <paladox>	 503
[21:34:14] * mutante submits https://gerrit.wikimedia.org/r/#/c/384760/
[21:34:24] <paladox>	 permission errors?
[21:34:25] <no_justification>	 Press f5 ;-)
[21:34:33] <paladox>	 oh
[21:34:34] <paladox>	 works
[21:34:46] <paladox>	 thanks :)
[21:34:53] <mutante>	 no_justification: your change is on puppetmaster now.. but i'm not on cobalt itself
[21:35:23] <mutante>	 well. to be correct.. now it is really done
[21:35:48] <icinga-wm>	 PROBLEM - puppet last run on db1095 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config]
[21:36:41] <no_justification>	 Notice: /Stage[main]/Gerrit::Jetty/File[/var/lib/gerrit2/review_site/bin/gerrit.war]/ensure: ensure changed 'file' to 'link'
[21:36:42] <no_justification>	 Wheee
[21:37:01] <mutante>	 yea, so it didnt have an issue linking over the existing file
[21:37:06] <paladox>	 :)
[21:37:07] <mutante>	 but after an "unlink" the original file would be gone
[21:37:15] <paladox>	 windows users will like the fix for inline edit :).
[21:37:24] <mutante>	 paladox: hehe
[21:37:27] <paladox>	 :)
[21:37:39] <icinga-wm>	 PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 4 minutes ago with 6 failures. Failed resources (up to 3 shown): Exec[git_pull_analytics/reportupdater],Exec[git_pull_wmde/scripts],Exec[git_pull_wmde/toolkit-analyzer-build],Exec[git_pull_mediawiki/event-schemas]
[21:38:18] <mutante>	 nice, now you can actually remove the old package setup
[21:38:20] <mutante>	 right
[21:38:28] <paladox>	 yeh
[21:38:44] <no_justification>	 Yeah. Basically that's simple: remove the package from the debian repo, archive (actually, probably just delete) the git repos
[21:39:02] <no_justification>	 Oh, remove from puppet manifest first I s'pose :)
[21:39:13] <paladox>	 heh :)
[21:39:16] <mutante>	 :) 
[21:39:17] * paladox submits change
[21:39:25] <mutante>	 my bus is arriving, be back soon
[21:39:55] <wikibugs>	 (03PS1) 10Chad: group1 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385099
[21:39:57] <wikibugs>	 (03CR) 10Chad: [C: 032] group1 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385099 (owner: 10Chad)
[21:40:09] <icinga-wm>	 PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas]
[21:42:09] <wikibugs>	 (03Draft1) 10Paladox: Gerrit: Remove gerrit package from apt [puppet] - 10https://gerrit.wikimedia.org/r/385100 (https://phabricator.wikimedia.org/T157414)
[21:42:14] <wikibugs>	 (03PS2) 10Paladox: Gerrit: Remove gerrit package from apt [puppet] - 10https://gerrit.wikimedia.org/r/385100 (https://phabricator.wikimedia.org/T157414)
[21:42:19] <paladox>	 no_justification mutante ^^ :)
[21:42:19] <wikibugs>	 (03Merged) 10jenkins-bot: group1 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385099 (owner: 10Chad)
[21:42:21] <paladox>	 ok
[21:42:35] <wikibugs>	 (03CR) 10jenkins-bot: group1 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385099 (owner: 10Chad)
[21:42:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Remove gerrit package from apt [puppet] - 10https://gerrit.wikimedia.org/r/385100 (https://phabricator.wikimedia.org/T157414) (owner: 10Paladox)
[21:43:28] <wikibugs>	 (03PS3) 10Paladox: Gerrit: Remove gerrit package from apt [puppet] - 10https://gerrit.wikimedia.org/r/385100 (https://phabricator.wikimedia.org/T157414)
[21:44:09] <wikibugs>	 (03Draft1) 10Paladox: Gerrit: Add missing group to /var/lib/gerrit2 [puppet] - 10https://gerrit.wikimedia.org/r/385101
[21:44:12] <wikibugs>	 (03PS2) 10Paladox: Gerrit: Add missing group to /var/lib/gerrit2 [puppet] - 10https://gerrit.wikimedia.org/r/385101
[21:44:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Remove gerrit package from apt [puppet] - 10https://gerrit.wikimedia.org/r/385100 (https://phabricator.wikimedia.org/T157414) (owner: 10Paladox)
[21:44:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Add missing group to /var/lib/gerrit2 [puppet] - 10https://gerrit.wikimedia.org/r/385101 (owner: 10Paladox)
[21:44:55] <paladox>	 sorry for spam.
[21:45:00] <wikibugs>	 (03PS4) 10Paladox: Gerrit: Remove gerrit package from apt [puppet] - 10https://gerrit.wikimedia.org/r/385100 (https://phabricator.wikimedia.org/T157414)
[21:45:09] <wikibugs>	 (03PS3) 10Paladox: Gerrit: Add missing group to /var/lib/gerrit2 [puppet] - 10https://gerrit.wikimedia.org/r/385101
[21:45:40] * paladox cherry picks
[21:46:45] <paladox>	 works
[21:46:51] <paladox>	 no puppet errors
[21:47:25] <wikibugs>	 (03CR) 10Chad: Gerrit: Remove gerrit package from apt (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/385100 (https://phabricator.wikimedia.org/T157414) (owner: 10Paladox)
[21:48:11] <wikibugs>	 (03CR) 10Chad: [C: 031] Gerrit: Add missing group to /var/lib/gerrit2 [puppet] - 10https://gerrit.wikimedia.org/r/385101 (owner: 10Paladox)
[21:48:31] <wikibugs>	 (03PS5) 10Paladox: Gerrit: Remove gerrit package from apt [puppet] - 10https://gerrit.wikimedia.org/r/385100 (https://phabricator.wikimedia.org/T157414)
[21:48:42] <wikibugs>	 (03CR) 10Paladox: Gerrit: Remove gerrit package from apt (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/385100 (https://phabricator.wikimedia.org/T157414) (owner: 10Paladox)
[21:49:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Remove gerrit package from apt [puppet] - 10https://gerrit.wikimedia.org/r/385100 (https://phabricator.wikimedia.org/T157414) (owner: 10Paladox)
[21:49:37] <wikibugs>	 (03PS6) 10Paladox: Gerrit: Remove gerrit package from apt [puppet] - 10https://gerrit.wikimedia.org/r/385100 (https://phabricator.wikimedia.org/T157414)
[21:58:26] <wikibugs>	 (03PS1) 10ArielGlenn: recompress bz2 files in batches not restricted to subjobs [dumps] - 10https://gerrit.wikimedia.org/r/385104
[21:58:52] <wikibugs>	 (03PS3) 10Paladox: Gerrit: Switch to the mariadb connector [puppet] - 10https://gerrit.wikimedia.org/r/384588 (https://phabricator.wikimedia.org/T176164)
[21:59:00] <wikibugs>	 (03PS4) 10Paladox: Gerrit: Switch to the mariadb connector [puppet] - 10https://gerrit.wikimedia.org/r/384588 (https://phabricator.wikimedia.org/T176164)
[22:01:13] <wikibugs>	 (03Draft1) 10Paladox: Gerrit: remove libbcprov-java and libbcpkix-java packages [puppet] - 10https://gerrit.wikimedia.org/r/385105
[22:01:16] <wikibugs>	 (03PS2) 10Paladox: Gerrit: remove libbcprov-java and libbcpkix-java packages [puppet] - 10https://gerrit.wikimedia.org/r/385105
[22:05:08] <icinga-wm>	 RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[22:05:47] <icinga-wm>	 RECOVERY - puppet last run on db1095 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[22:06:21] <mutante>	 paladox: back, what's up with "missing group"
[22:06:28] <mutante>	 looks
[22:06:32] <paladox>	 nothing, just inconsitent :)
[22:06:42] <paladox>	 all the other code has group => 'gerrit2'
[22:06:50] <mutante>	 ok
[22:06:59] <paladox>	 :).
[22:07:38] <icinga-wm>	 RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[22:08:01] <mutante>	 paladox: The change could not be rebased due to a conflict during merge.
[22:08:10] <paladox>	 ah the group one?
[22:08:13] <mutante>	 yep
[22:08:14] <paladox>	 that one is dependent
[22:08:20] <paladox>	 on https://gerrit.wikimedia.org/r/385100 
[22:08:59] <mutante>	 technically yea, but code-wise, no
[22:09:13] <mutante>	 but not important 
[22:09:22] <mutante>	 it can go after the other one then
[22:10:12] <wikibugs>	 (03PS1) 10MaxSem: Remove old email blocking hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385106 (https://phabricator.wikimedia.org/T175419)
[22:10:45] <mutante>	 paladox: did you try 385100 yet? i mean the actual puppet run with the new "require" lines
[22:10:51] <paladox>	 yes
[22:10:54] <paladox>	 :)
[22:10:58] <mutante>	 and you actually have scap on i ttoo
[22:11:23] <paladox>	 yep
[22:11:28] <wikibugs>	 (03PS7) 10Paladox: Gerrit: Remove gerrit package from apt [puppet] - 10https://gerrit.wikimedia.org/r/385100 (https://phabricator.wikimedia.org/T157414)
[22:11:30] <wikibugs>	 (03PS4) 10Paladox: Gerrit: Add missing group to /var/lib/gerrit2 [puppet] - 10https://gerrit.wikimedia.org/r/385101
[22:12:02] <paladox>	 rebased.
[22:14:17] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "yea, it's like that on cobalt" [puppet] - 10https://gerrit.wikimedia.org/r/385101 (owner: 10Paladox)
[22:14:34] <mutante>	 "submit incl. parents"
[22:14:52] <paladox>	 :)
[22:14:55] <mutante>	 but dont worry too much, we can do both 
[22:14:58] <mutante>	 just compiling it
[22:15:04] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 032] Deploy ReadingLists to the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384908 (https://phabricator.wikimedia.org/T174651) (owner: 10Gergő Tisza)
[22:15:11] <paladox>	 ok thanks :)
[22:16:11] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy ReadingLists to the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384908 (https://phabricator.wikimedia.org/T174651) (owner: 10Gergő Tisza)
[22:16:35] <wikibugs>	 (03CR) 10Zoranzoki21: [C: 031] Remove old email blocking hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385106 (https://phabricator.wikimedia.org/T175419) (owner: 10MaxSem)
[22:16:37] <wikibugs>	 (03CR) 10jenkins-bot: Deploy ReadingLists to the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384908 (https://phabricator.wikimedia.org/T174651) (owner: 10Gergő Tisza)
[22:17:54] <wikibugs>	 (03PS8) 10Dzahn: Gerrit: stop installing deb package, replace with scap requires [puppet] - 10https://gerrit.wikimedia.org/r/385100 (https://phabricator.wikimedia.org/T157414) (owner: 10Paladox)
[22:18:14] <mutante>	 i am just renaming it slightly because "removing it from APT" wasnt really what this does
[22:18:19] <paladox>	 thanks
[22:18:23] <mutante>	 that would be like running reprepro commands on apt.wm
[22:19:05] <paladox>	 heh
[22:19:51] <wikibugs>	 10Operations, 10MediaWiki-General-or-Unknown, 10RfC: Bump PHP requirement to 5.6 in 1.31 - https://phabricator.wikimedia.org/T178538#3695340 (10MaxSem)
[22:22:20] <wikibugs>	 10Operations, 10MediaWiki-General-or-Unknown, 10RfC: Bump PHP requirement to 5.6 in 1.31 - https://phabricator.wikimedia.org/T178538#3695361 (10Jdforrester-WMF)
[22:22:34] <wikibugs>	 (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/8369/" [puppet] - 10https://gerrit.wikimedia.org/r/385100 (https://phabricator.wikimedia.org/T157414) (owner: 10Paladox)
[22:22:38] <wikibugs>	 (03CR) 10Dzahn: [C: 032] Gerrit: stop installing deb package, replace with scap requires [puppet] - 10https://gerrit.wikimedia.org/r/385100 (https://phabricator.wikimedia.org/T157414) (owner: 10Paladox)
[22:22:43] <paladox>	 thanks :)
[22:24:29] <mutante>	 on gerrit2001 - Service[gerrit]/ensure: ensure changed 'stopped' to 'running'
[22:24:59] <wikibugs>	 10Operations, 10MediaWiki-General-or-Unknown, 10TechCom-RfC: Bump PHP requirement to 5.6 in 1.31 - https://phabricator.wikimedia.org/T178538#3695375 (10Legoktm)
[22:25:02] <paladox>	 heh
[22:25:08] <paladox>	 i presume because of the db thing
[22:25:12] <wikibugs>	 10Operations, 10MediaWiki-General-or-Unknown, 10TechCom-RfC: Bump PHP requirement to 5.6 in 1.31 - https://phabricator.wikimedia.org/T178538#3695377 (10Jdforrester-WMF)
[22:25:13] <paladox>	 it probaly timedout
[22:26:06] <mutante>	 on cobalt: no-op
[22:26:06] <mutante>	 done
[22:26:14] <paladox>	 :)
[22:26:29] <wikibugs>	 (03PS5) 10Dzahn: Gerrit: Add missing group to /var/lib/gerrit2 [puppet] - 10https://gerrit.wikimedia.org/r/385101 (owner: 10Paladox)
[22:26:31] <paladox>	 package now removed from puppet code no_justification :)
[22:26:38] <paladox>	 repo can be archived now.
[22:27:04] <wikibugs>	 10Operations, 10MediaWiki-General-or-Unknown, 10TechCom-RfC: Bump PHP requirement to 5.6 in 1.31 - https://phabricator.wikimedia.org/T178538#3695340 (10Jdforrester-WMF) Possible tasks blocking running MediaWiki in production in 5.6-compatible environments: * Convert `mwscript` hosts to run on Zend 5.6, not 5...
[22:27:15] <mutante>	 i also want to move something from an operations/deb repo to a non-deb repo  but i want to keep my history...
[22:27:37] <mutante>	 is now cloning from the repo in "debs" hierarchy without actually building it
[22:27:45] <paladox>	 git push --mirror
[22:27:46] <paladox>	 ?
[22:27:51] <logmsgbot>	 !log tgr@tin Synchronized wmf-config/: T174651 / gerrit 384908 - deploy ReadingLists to beta - noop for production (duration: 00m 53s)
[22:27:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:27:59] <stashbot>	 T174651: Beta testing of the ReadingLists extension - https://phabricator.wikimedia.org/T174651
[22:28:10] <mutante>	 paladox: oh, looking
[22:28:15] <paladox>	 :)
[22:28:31] <paladox>	 or there's https://stackoverflow.com/questions/44777043/git-copy-history-of-file-from-one-repository-to-another
[22:28:37] <wikibugs>	 10Operations, 10MediaWiki-General-or-Unknown, 10TechCom-RfC: Bump PHP requirement to 5.6 in 1.31 - https://phabricator.wikimedia.org/T178538#3695387 (10ArielGlenn) I plan to have the dumps hosts upgraded long before this deadline approaches.
[22:28:46] <mutante>	 would it be nice to keep the gerrit history ?
[22:29:13] <paladox>	 oh, i doin't know if that would be supported
[22:31:27] <mutante>	 i merged your other change but it wasnt on channel?
[22:31:55] <paladox>	 hmm, ah i see because you merged it.
[22:32:04] <paladox>	 if jenkins did it, it would show :)
[22:32:10] <paladox>	 thanks
[22:32:34] <mutante>	 paladox: yea, so on gerrit2001, it times out like you said
[22:32:39] <paladox>	 yep
[22:32:40] <mutante>	 after a while puppet will start it again
[22:32:45] <tgr>	 what triggers scap on the beta cluster?
[22:32:48] <paladox>	 yeh
[22:32:52] <tgr>	 i18n data generation, specifically
[22:33:05] <paladox>	 tgr jenkins i think, though not sure about i18n
[22:33:17] <mutante>	 paladox: all operations/puppet changes are merged by humans though
[22:33:26] <tgr>	 nvm, looks like it did run, just was a little behind code updates
[22:34:01] <paladox>	 yep
[22:36:48] <wikibugs>	 (03CR) 10Putnik: [C: 031] Add high-density logos for the English Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385094 (https://phabricator.wikimedia.org/T177506) (owner: 10Odder)
[22:37:04] <wikibugs>	 10Operations, 10MediaWiki-General-or-Unknown, 10TechCom-RfC: Bump PHP requirement to 5.6 in 1.31 - https://phabricator.wikimedia.org/T178538#3695394 (10Jdforrester-WMF)
[22:41:25] <wikibugs>	 10Operations, 10DC-Ops: update firmware on scs consoles - https://phabricator.wikimedia.org/T174475#3695401 (10RobH)
[22:41:52] <wikibugs>	 10Operations, 10DC-Ops: update firmware on scs consoles - https://phabricator.wikimedia.org/T174475#3563233 (10RobH) a:05RobH>03mark I've assigned this to @mark for feedback on the other scs consoles in esams.  I'm guessing they are old cruft and no longer used?  If so, this can be resolved.
[22:42:25] <wikibugs>	 (03PS2) 10ArielGlenn: recompress bz2 files in batches not restricted to subjobs [dumps] - 10https://gerrit.wikimedia.org/r/385104
[22:45:41] <wikibugs>	 (03Abandoned) 10Paladox: Gerrit: Use the mariadb plugin instead of mysql [puppet] - 10https://gerrit.wikimedia.org/r/336003 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox)
[22:45:59] <wikibugs>	 (03PS12) 10Paladox: Gerrit: Enable logstash by default for prod gerrit [puppet] - 10https://gerrit.wikimedia.org/r/332531 (https://phabricator.wikimedia.org/T141324)
[22:55:04] <logmsgbot>	 !log tgr@tin Synchronized private/: Remove the llama (T150554) (duration: 00m 50s)
[22:55:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:58:53] <mutante>	 paladox: yay, keep going. the older the better, heh
[22:59:16] <paladox>	 lol
[23:00:04] <jouncebot>	 addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy Evening SWAT (Max 8 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171018T2300).
[23:00:04] <jouncebot>	 Jdlrobson, twkozlowski, and MaxSem: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[23:00:33] * MaxSem waves
[23:03:55] <jdlrobson>	 \o
[23:03:56] * no_justification steals deploy conch
[23:03:58] <no_justification>	 For a few
[23:04:26] <logmsgbot>	 !log demon@tin Synchronized php: symlink swap (duration: 00m 49s)
[23:04:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:05:29] <logmsgbot>	 !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 to wmf.4
[23:05:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:24:21] <jdlrobson>	 no_justification: so are you doing the swat window or is MaxSem ? Little confused with what's happening...
[23:25:03] <MaxSem>	 I had a meeting in the beginning
[23:25:15] <MaxSem>	 no_justification: are you done?
[23:27:24] <MaxSem>	 assuming done
[23:28:44] <wikibugs>	 (03PS2) 10MaxSem: Use correct name for gadgets popup on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385007 (https://phabricator.wikimedia.org/T178438) (owner: 10Jdlrobson)
[23:28:48] <wikibugs>	 (03CR) 10MaxSem: [C: 032] Use correct name for gadgets popup on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385007 (https://phabricator.wikimedia.org/T178438) (owner: 10Jdlrobson)
[23:30:00] <wikibugs>	 (03Merged) 10jenkins-bot: Use correct name for gadgets popup on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385007 (https://phabricator.wikimedia.org/T178438) (owner: 10Jdlrobson)
[23:30:15] <wikibugs>	 (03CR) 10jenkins-bot: Use correct name for gadgets popup on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385007 (https://phabricator.wikimedia.org/T178438) (owner: 10Jdlrobson)
[23:31:32] <wikibugs>	 (03CR) 10Aaron Schulz: Allow specifying --group to sql (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/384951 (owner: 10Hoo man)
[23:31:52] <MaxSem>	 jdlrobson: the config change is live on mwdebug1002
[23:31:59] <jdlrobson>	 MaxSem: swwet. testing
[23:32:19] <jdlrobson>	 fixed! yay!
[23:32:56] <jdlrobson>	 MaxSem: you can sync
[23:34:19] <logmsgbot>	 !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/385007/ (duration: 00m 51s)
[23:34:22] <MaxSem>	 jdlrobson: ^
[23:34:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:35:01] <jdlrobson>	 hurrah. one down
[23:36:00] <MaxSem>	 jdlrobson: pulled the Collection change for both branches
[23:36:09] <jdlrobson>	 MaxSem: hurrah. should be quick to test
[23:38:01] <MaxSem>	 next is Odder's change but I don't see him around
[23:39:50] <MaxSem>	 anybody familiar with WB logos to check this over?
[23:41:08] <jdlrobson>	 MaxSem: please ping when collection changes are synced.
[23:41:21] <MaxSem>	 jdlrobson: waiting for an ok from you
[23:41:48] <jdlrobson>	 MaxSem: oh didn't realise. Should I be testing on debug1002?
[23:41:54] <jdlrobson>	 they look good there :)
[23:41:55] <MaxSem>	 as always
[23:42:15] <jdlrobson>	 MaxSem: thx
[23:42:27] <MaxSem>	 ok, deploying
[23:43:15] <no_justification>	 Why IRC pings aren't actually pinging me today....
[23:44:14] <logmsgbot>	 !log maxsem@tin Synchronized php-1.31.0-wmf.3/extensions/Collection/: https://gerrit.wikimedia.org/r/#/c/385005/ (duration: 00m 51s)
[23:44:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:45:30] <jdlrobson>	 thansk MaxSem !!! :)
[23:45:40] <logmsgbot>	 !log maxsem@tin Synchronized php-1.31.0-wmf.4/extensions/Collection/: https://gerrit.wikimedia.org/r/#/c/384903/ (duration: 00m 50s)
[23:45:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:46:03] <wikibugs>	 (03PS2) 10MaxSem: Add high-density logos for the English Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385094 (https://phabricator.wikimedia.org/T177506) (owner: 10Odder)
[23:46:07] <wikibugs>	 (03CR) 10MaxSem: [C: 032] Add high-density logos for the English Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385094 (https://phabricator.wikimedia.org/T177506) (owner: 10Odder)
[23:47:06] <wikibugs>	 (03PS8) 10Dzahn: bastionhost: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/353599
[23:47:23] <wikibugs>	 (03Merged) 10jenkins-bot: Add high-density logos for the English Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385094 (https://phabricator.wikimedia.org/T177506) (owner: 10Odder)
[23:47:33] <wikibugs>	 (03CR) 10jenkins-bot: Add high-density logos for the English Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385094 (https://phabricator.wikimedia.org/T177506) (owner: 10Odder)
[23:47:38] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] bastionhost: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/353599 (owner: 10Dzahn)
[23:49:05] <logmsgbot>	 !log maxsem@tin Synchronized static/images/project-logos/: https://gerrit.wikimedia.org/r/#/c/385094/2 (duration: 00m 50s)
[23:49:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:51:47] <logmsgbot>	 !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/385094/2 (duration: 00m 50s)
[23:51:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:53:11] <wikibugs>	 10Operations, 10ops-ulsfo: Multiple systems in ulsfo 1.22 showing PSU failures - https://phabricator.wikimedia.org/T177622#3695536 (10RobH)
[23:53:13] <wikibugs>	 10Operations, 10ops-ulsfo, 10Traffic, 10Patch-For-Review: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3695537 (10RobH)
[23:53:54] <wikibugs>	 (03PS2) 10MaxSem: Remove old email blocking hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385106 (https://phabricator.wikimedia.org/T175419)
[23:54:00] <wikibugs>	 (03CR) 10MaxSem: [C: 032] Remove old email blocking hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385106 (https://phabricator.wikimedia.org/T175419) (owner: 10MaxSem)
[23:55:11] <wikibugs>	 (03Merged) 10jenkins-bot: Remove old email blocking hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385106 (https://phabricator.wikimedia.org/T175419) (owner: 10MaxSem)
[23:56:13] <wikibugs>	 (03CR) 10jenkins-bot: Remove old email blocking hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385106 (https://phabricator.wikimedia.org/T175419) (owner: 10MaxSem)