[00:00:04] twentyafterfour: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160901T0000). Please do the needful. [00:08:34] PROBLEM - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [00:08:53] PROBLEM - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is CRITICAL: Connection refused [00:10:36] PROBLEM - cassandra-c service on restbase1011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [00:10:52] on that ^^^ [00:11:59] (03CR) 10Dzahn: "I have no idea, but somehow i doubt it will cleanup things, i'd expect more that it tries to pull from new repo on top of everything that " [puppet] - 10https://gerrit.wikimedia.org/r/296687 (https://phabricator.wikimedia.org/T139008) (owner: 10Ladsgroup) [00:13:13] RECOVERY - cassandra-c service on restbase1011 is OK: OK - cassandra-c is active [00:13:22] 06Operations, 10Ops-Access-Requests: root access on security-tools instances for Darian Patrick - https://phabricator.wikimedia.org/T138873#2600519 (10Dzahn) a:05Dzahn>03None [00:13:43] RECOVERY - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is OK: SSL OK - Certificate restbase1011-c valid until 2017-03-01 14:11:15 +0000 (expires in 181 days) [00:13:50] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 07Wikimedia-log-errors: Elastica warning about Retrying connection to search.svc.eqiad.wmnet - https://phabricator.wikimedia.org/T144450#2600157 (10EBernhardson) I'm able to reproduce this fairly regularly from `nc`, although i need to... [00:14:02] RECOVERY - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is OK: TCP OK - 0.002 second response time on port 9042 [00:26:54] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 737 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5034673 keys - replication_delay is 737 [00:41:47] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 4 others: CN: Stop using the geoiplookup HTTPS service (always use the Cookie) - https://phabricator.wikimedia.org/T143271#2600579 (10AndyRussG) >>! In T143271#2599100, @Mattflaschen-WMF wrote: > If this affects windo... [00:42:43] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5024883 keys - replication_delay is 25 [00:58:18] (03PS1) 10Gergő Tisza: Do not use $wgExtensionFunctions to set globals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307893 (https://phabricator.wikimedia.org/T143055) [00:59:39] (03CR) 1020after4: [C: 031] "As far as I can remember:" [puppet] - 10https://gerrit.wikimedia.org/r/296687 (https://phabricator.wikimedia.org/T139008) (owner: 10Ladsgroup) [01:00:20] !log reboot db1025 for kernel update [01:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:02:07] (03CR) 1020after4: "in case it isn't clear, this only affects initial deployment repo setup on the deployment hosts and it should not have any affect on the t" [puppet] - 10https://gerrit.wikimedia.org/r/296687 (https://phabricator.wikimedia.org/T139008) (owner: 10Ladsgroup) [01:05:13] RECOVERY - check_recurring_gc_failures_missed on db1025 is OK: OK [01:10:10] (03PS2) 10Gergő Tisza: Do not use $wgExtensionFunctions to set globals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307893 (https://phabricator.wikimedia.org/T143055) [01:24:09] (03PS1) 10Legoktm: Enable flake8 on Python 3 [software/service-checker] - 10https://gerrit.wikimedia.org/r/307895 [01:24:11] (03PS1) 10Legoktm: Run tests on Python 3.4 and 3.5 [software/service-checker] - 10https://gerrit.wikimedia.org/r/307896 [01:35:37] (03PS9) 1020after4: WIP: Scap swat command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) [01:40:06] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=829 [critical =325] [01:42:54] (03PS1) 10Dzahn: Revert "archiva: migration class to rsync data to new host" [puppet] - 10https://gerrit.wikimedia.org/r/307900 [01:43:19] (03CR) 10Dzahn: "all is fine, this is just the reminder to remove temp setup after migration is done" [puppet] - 10https://gerrit.wikimedia.org/r/307900 (owner: 10Dzahn) [01:45:07] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=829 [critical =325] [01:47:09] ACKNOWLEDGEMENT - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=829 [critical =325] Jeff_Green known problem [01:47:23] (03PS1) 10Dzahn: wikistats: remove $realm check, simplify role [puppet] - 10https://gerrit.wikimedia.org/r/307902 [01:49:26] (03CR) 10Dzahn: [C: 032] wikistats: remove $realm check, simplify role [puppet] - 10https://gerrit.wikimedia.org/r/307902 (owner: 10Dzahn) [02:13:45] (03PS1) 10Dzahn: url_downloader: add IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/307905 [02:15:01] PROBLEM - PyBal backends health check on lvs4004 is CRITICAL: PYBAL CRITICAL - uploadlb_443 - Could not depool server cp4013.ulsfo.wmnet because of too many down!: uploadlb6_443 - Could not depool server cp4005.ulsfo.wmnet because of too many down! [02:15:21] PROBLEM - LVS HTTPS IPv4 on upload-lb.ulsfo.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:16:12] PROBLEM - PyBal backends health check on lvs4002 is CRITICAL: PYBAL CRITICAL - uploadlb6_443 - Could not depool server cp4006.ulsfo.wmnet because of too many down! [02:18:03] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [02:21:19] RECOVERY - PyBal backends health check on lvs4002 is OK: PYBAL OK - All pools are healthy [02:22:00] (03PS1) 10BBlack: depool ulsfo, some kind of issues... [dns] - 10https://gerrit.wikimedia.org/r/307906 [02:22:12] i saw the alerts, looked at cp4006 and 4007 [02:22:14] because they are in icinga [02:22:15] (03CR) 10BBlack: [C: 032] depool ulsfo, some kind of issues... [dns] - 10https://gerrit.wikimedia.org/r/307906 (owner: 10BBlack) [02:22:22] they look busy but running [02:22:28] !log ulsfo depool [02:22:33] thanks bblack [02:22:34] bblack: thanks was just hoping on [02:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:22:54] thanks for paging :) [02:22:56] it was trying to depool them but couldnt [02:23:10] looks like pybal couldnt depool anymore than it did [02:23:13] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [02:23:33] hmm.. back ? [02:23:33] I'm assuming the two likely underlying problems here are either the varnish4 install in ulsfo upload cluster, or the recent ulsfo link problems. [02:23:41] but either way, depooling ulsfo fixes it for the evening [02:23:46] ok cool [02:24:52] PROBLEM - Varnish HTTP upload-frontend - port 3127 on cp4006 is CRITICAL: Connection timed out [02:26:05] 26535 vcache 20 0 0.131t 0.080t 87376 S 888.3 43.5 4797:21 varnishd [02:26:09] 27083 vcache 20 0 0.719t 0.086t 0.077t S 653.3 46.6 12498:07 varnishd [02:26:16] those 888 and 653 numbers are %cpu [02:26:30] loadavg on cp4006 is at like 66 right now, and I'm sure that's come down since the middle of the problem period [02:26:34] so, likely varnish4 issues [02:27:28] RECOVERY - Varnish HTTP upload-frontend - port 3127 on cp4006 is OK: HTTP OK: HTTP/1.1 200 OK - 327 bytes in 0.150 second response time [02:27:48] do you restart services when it happens? [02:28:14] it doesn't ever happen :) [02:28:14] RECOVERY - PyBal backends health check on lvs4004 is OK: PYBAL OK - All pools are healthy [02:28:24] RECOVERY - LVS HTTPS IPv4 on upload-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 921 bytes in 1.592 second response time [02:28:31] I'm leaving it for ema to do forensics on in the morning. none of the daemons actually crashed [02:28:31] fair :) [02:28:38] yep [02:30:15] https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Upload+caches+ulsfo&m=cpu_report&s=by+name&mc=2&g=cpu_report [02:30:26] locked up in sys%, interesting [02:30:33] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.16) (duration: 12m 21s) [02:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:30:52] may not be strictly v4's fault, though. we also left some systemtap stuff running overnight to debug an issue. it could be stap itself caused a problem. [02:32:24] anyways, I assume with the depool we're stable, call/text if not [02:33:04] yes, thanks [02:34:54] PROBLEM - puppet last run on cp4007 is CRITICAL: CRITICAL: Puppet has 1 failures [02:42:43] RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [02:46:26] PROBLEM - MD RAID on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:46:55] PROBLEM - configured eth on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:47:16] PROBLEM - DPKG on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:47:35] PROBLEM - puppet last run on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:47:52] PROBLEM - SSH on ms-be1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:48:13] PROBLEM - HP RAID on ms-be1016 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [02:48:46] PROBLEM - very high load average likely xfs on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:49:25] PROBLEM - swift-container-server on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:49:25] PROBLEM - Check size of conntrack table on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:49:26] PROBLEM - swift-container-updater on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:49:45] PROBLEM - swift-account-auditor on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:49:45] PROBLEM - swift-account-replicator on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:49:45] PROBLEM - dhclient process on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:49:57] PROBLEM - swift-object-replicator on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:50:16] PROBLEM - Disk space on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:50:16] PROBLEM - salt-minion processes on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:50:29] PROBLEM - swift-object-server on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:50:38] PROBLEM - swift-account-server on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:50:38] PROBLEM - swift-account-reaper on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:51:06] PROBLEM - swift-container-auditor on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:51:07] PROBLEM - swift-object-updater on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:51:07] PROBLEM - swift-object-auditor on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:51:19] PROBLEM - swift-container-replicator on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:05:20] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.17) (duration: 17m 59s) [03:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:12:34] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Sep 1 03:12:34 UTC 2016 (duration 7m 14s) [03:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:35:08] (03PS2) 10Andrew Bogott: Cloud.cfg tidying for precise and trusty images [puppet] - 10https://gerrit.wikimedia.org/r/307881 [03:38:02] (03CR) 10Andrew Bogott: [C: 032] Cloud.cfg tidying for precise and trusty images [puppet] - 10https://gerrit.wikimedia.org/r/307881 (owner: 10Andrew Bogott) [04:03:44] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [04:06:15] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5031030 keys - replication_delay is 0 [04:12:52] (03PS1) 10Legoktm: Properly support 'basePath' [software/service-checker] - 10https://gerrit.wikimedia.org/r/307910 [04:12:54] (03PS1) 10Legoktm: Fix documentation in README [software/service-checker] - 10https://gerrit.wikimedia.org/r/307911 [04:13:47] (03CR) 10jenkins-bot: [V: 04-1] Properly support 'basePath' [software/service-checker] - 10https://gerrit.wikimedia.org/r/307910 (owner: 10Legoktm) [04:16:05] (03CR) 10Legoktm: [C: 04-1] Allow service-checker to read YAML-formatted specs (031 comment) [software/service-checker] - 10https://gerrit.wikimedia.org/r/306707 (https://phabricator.wikimedia.org/T136839) (owner: 10Mobrovac) [04:23:34] (03CR) 10Legoktm: "recheck" [software/service-checker] - 10https://gerrit.wikimedia.org/r/307910 (owner: 10Legoktm) [05:00:00] (03PS1) 10Yuvipanda: extdist: Fix typo in comments [puppet] - 10https://gerrit.wikimedia.org/r/307915 [05:01:02] (03PS2) 10Yuvipanda: extdist: Fix typo in comments [puppet] - 10https://gerrit.wikimedia.org/r/307915 [05:01:07] (03CR) 10Yuvipanda: [C: 032 V: 032] extdist: Fix typo in comments [puppet] - 10https://gerrit.wikimedia.org/r/307915 (owner: 10Yuvipanda) [05:09:45] PROBLEM - NTP on ms-be1016 is CRITICAL: NTP CRITICAL: No response from NTP server [06:20:35] so db1047 is starting to reduce its replication lag after the alter table [06:35:20] 06Operations, 10Ops-Access-Requests: onboarding Manuel Arostegui in ops - https://phabricator.wikimedia.org/T144469#2600739 (10jcrespo) [06:51:24] 06Operations, 10Ops-Access-Requests: onboarding Manuel Arostegui in ops - https://phabricator.wikimedia.org/T144469#2600753 (10jcrespo) [06:52:03] 06Operations, 10Ops-Access-Requests: onboarding Manuel Arostegui in ops - https://phabricator.wikimedia.org/T144469#2600739 (10jcrespo) I've verified though @wikimedia email that his account is Marostegui. [07:08:34] 06Operations: Multiple servers in codfw fail to respond to IPMI commands during reimaging - https://phabricator.wikimedia.org/T142726#2600760 (10MoritzMuehlenhoff) @Papaul, that setting fixed it, mw2148 could now be reimaged. During the reimaging we'll likely run into further servers with the same problem, so le... [07:12:10] 06Operations: Multiple servers in codfw fail to respond to IPMI commands during reimaging - https://phabricator.wikimedia.org/T142726#2600761 (10MoritzMuehlenhoff) @Papaul: Do the servers need to be powered down to change the setting or can you change that for running hosts as well? [07:16:46] 06Operations, 10Ops-Access-Requests: onboarding Manuel Arostegui in ops - https://phabricator.wikimedia.org/T144469#2600762 (10jcrespo) [07:22:15] PROBLEM - BGP status on cr1-ulsfo is CRITICAL: BGP CRITICAL - AS1299/IPv6: Active, AS1299/IPv4: Active [07:23:52] 06Operations, 10Ops-Access-Requests: Grant Manuel root on WMF cluster - https://phabricator.wikimedia.org/T144470#2600771 (10jcrespo) [07:26:23] 06Operations, 06Discovery, 06Maps, 06WMF-Legal, 03Maps-Sprint: Define tile usage policy - https://phabricator.wikimedia.org/T141815#2513156 (10timautin) Hi, I'm the developper of the app E-walk: https://play.google.com/store/apps/details?id=com.at.ewalk.free I'd definitively be interested if we could u... [07:29:20] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 22 probes of 426 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [07:29:44] 06Operations, 10Ops-Access-Requests: onboarding Manuel Arostegui in ops - https://phabricator.wikimedia.org/T144469#2600805 (10jcrespo) [07:31:11] 06Operations, 10Ops-Access-Requests: onboarding Manuel Arostegui in ops - https://phabricator.wikimedia.org/T144469#2600812 (10Marostegui) cloak requested [07:35:49] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 2 probes of 426 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [07:45:53] !log reimaging mw2116-2119 to jessie [07:45:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:46:50] 06Operations, 10ops-eqiad: Broken disk on copper - https://phabricator.wikimedia.org/T144261#2593788 (10elukey) From the dmesg: ``` [12690812.776141] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 [12690812.782847] ata2.00: BMDMA stat 0x24 [12690812.786685] ata2.00: failed command: WRITE DMA [12690... [07:47:02] 06Operations, 10ops-eqiad: Broken disk on copper - https://phabricator.wikimedia.org/T144261#2600835 (10elukey) p:05Triage>03High [07:52:12] 06Operations, 10Ops-Access-Requests: Grant Manuel root on WMF cluster - https://phabricator.wikimedia.org/T144470#2600845 (10Marostegui) I have read and signed https://phabricator.wikimedia.org/L3 "You signed this document on Thu, Sep 1, 09:51." [07:57:29] (03CR) 10Alexandros Kosiaris: [C: 032] url_downloader: add IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/307905 (owner: 10Dzahn) [07:57:34] (03PS2) 10Alexandros Kosiaris: url_downloader: add IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/307905 (owner: 10Dzahn) [07:57:40] (03CR) 10Alexandros Kosiaris: [V: 032] url_downloader: add IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/307905 (owner: 10Dzahn) [07:59:46] RECOVERY - BGP status on cr1-ulsfo is OK: BGP OK - up: 16, down: 1, shutdown: 0 [08:04:34] (03PS1) 10Muehlenhoff: Remove decommissioned host rubidium from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/307917 (https://phabricator.wikimedia.org/T118213) [08:06:05] PROBLEM - DPKG on mw2148 is CRITICAL: Connection refused by host [08:08:46] RECOVERY - DPKG on mw2148 is OK: All packages OK [08:09:42] (03CR) 10Ladsgroup: [C: 031] "So it's okay to merge! One of ops should delete the old repo in tin after this merge I guess." [puppet] - 10https://gerrit.wikimedia.org/r/296687 (https://phabricator.wikimedia.org/T139008) (owner: 10Ladsgroup) [08:14:37] (03PS1) 10Elukey: Remove g+w flag to /srv/log/eventlogging to allow proper logrotation [puppet] - 10https://gerrit.wikimedia.org/r/307918 (https://phabricator.wikimedia.org/T132324) [08:19:01] 06Operations, 10Ops-Access-Requests: Grant Manuel root on WMF cluster - https://phabricator.wikimedia.org/T144470#2600771 (10mark) Approved, proceed. [08:37:24] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 07Wikimedia-log-errors: Elastica warning about Retrying connection to search.svc.eqiad.wmnet - https://phabricator.wikimedia.org/T144450#2600908 (10hashar) Erik mentioned `elastic1028` had puppet disabled and elasticsearch disabed at 18... [08:40:15] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 07Wikimedia-log-errors: Elastica warning about Retrying connection to search.svc.eqiad.wmnet - https://phabricator.wikimedia.org/T144450#2600910 (10hashar) From SAL: **2016-08-31** ``` lang=irc 18:54 shutting down elasticsearch... [08:42:03] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 07Wikimedia-log-errors: Elastica warning about Retrying connection to search.svc.eqiad.wmnet - https://phabricator.wikimedia.org/T144450#2600914 (10hashar) p:05Triage>03Normal [08:48:26] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 10Research-collaborations, 10Research-management: Request access to data for WDQS research - https://phabricator.wikimedia.org/T142780#2600924 (10AlexKrauseTUD) @Cmjohnson As said on Monday, 29th of August: AlexKrauseTUD or is it another account w... [08:48:51] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 07Wikimedia-log-errors: Elastica warning about Retrying connection to search.svc.eqiad.wmnet - https://phabricator.wikimedia.org/T144450#2600925 (10hashar) ETCD config is at https://config-master.wikimedia.org/conftool/eqiad/search and... [08:54:54] (03PS1) 10Hashar: noc: link to conftool and wikitech pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307920 [08:56:18] (03CR) 10Hashar: "That is to have https://noc.wikimedia.org/ to refer to the conftool config files." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307920 (owner: 10Hashar) [09:04:29] !log reboot ms-be1016, stuck and nothing on console [09:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:06:08] !log installing postgres security updates on labsdb1004/1006/1007 [09:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:07:34] RECOVERY - swift-account-server on ms-be1016 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [09:07:56] RECOVERY - Check size of conntrack table on ms-be1016 is OK: OK: nf_conntrack is 4 % full [09:07:57] RECOVERY - swift-container-auditor on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:08:05] RECOVERY - swift-container-replicator on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [09:08:05] RECOVERY - Disk space on ms-be1016 is OK: DISK OK [09:08:15] RECOVERY - very high load average likely xfs on ms-be1016 is OK: OK - load average: 8.87, 2.64, 0.91 [09:08:15] RECOVERY - dhclient process on ms-be1016 is OK: PROCS OK: 0 processes with command name dhclient [09:08:25] RECOVERY - swift-container-server on ms-be1016 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [09:08:35] RECOVERY - DPKG on ms-be1016 is OK: All packages OK [09:08:37] RECOVERY - HP RAID on ms-be1016 is OK: OK: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2, Controller [09:08:54] (03PS1) 10Marostegui: Added Manuel Arostegui (marostegui) to the Ops group [puppet] - 10https://gerrit.wikimedia.org/r/307921 [09:10:00] (03CR) 10jenkins-bot: [V: 04-1] Added Manuel Arostegui (marostegui) to the Ops group [puppet] - 10https://gerrit.wikimedia.org/r/307921 (owner: 10Marostegui) [09:18:43] (03PS1) 10Marostegui: Removed tab and added spaces [puppet] - 10https://gerrit.wikimedia.org/r/307922 [09:20:58] (03PS2) 10Hashar: noc: link to conftool and wikitech pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307920 [09:21:06] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 07Wikimedia-log-errors: Elastica warning about Retrying connection to search.svc.eqiad.wmnet - https://phabricator.wikimedia.org/T144450#2600972 (10Gehel) >>! In T144450#2600908, @hashar wrote: > Erik mentioned `elastic1028` had puppet... [09:22:29] (03PS3) 10Hashar: noc: link to conftool and wikitech pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307920 [09:23:14] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, fixed capitalization/indentation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307920 (owner: 10Hashar) [09:24:49] !log reimaging mw2163-2166 to jessie [09:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:26:15] (03CR) 10Filippo Giunchedi: "LGTM from a quick glance, did you run it through the compiler?" [puppet] - 10https://gerrit.wikimedia.org/r/307510 (https://phabricator.wikimedia.org/T144043) (owner: 10Muehlenhoff) [09:28:30] (03Abandoned) 10Jcrespo: Added Manuel Arostegui (marostegui) to the Ops group [puppet] - 10https://gerrit.wikimedia.org/r/307921 (owner: 10Marostegui) [09:29:36] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 07Wikimedia-log-errors: Elastica warning about Retrying connection to search.svc.eqiad.wmnet - https://phabricator.wikimedia.org/T144450#2600975 (10hashar) Sorry I have been misleading. That link https://config-master.wikimedia.org/pyb... [09:35:08] !log reimaging mw2167 -> mw2170 to Jessie [09:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:37:22] !log reimaging mw2061-2064 to jessie [09:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:38:31] (03PS2) 10Jcrespo: Add Manuel Arostegui (marostegui) cluster access with ops rights [puppet] - 10https://gerrit.wikimedia.org/r/307922 (https://phabricator.wikimedia.org/T144470) (owner: 10Marostegui) [09:42:21] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [09:44:21] 06Operations: Multiple servers in codfw fail to respond to IPMI commands during reimaging - https://phabricator.wikimedia.org/T142726#2600990 (10MoritzMuehlenhoff) @Papaul, @Cmjohnson : I guess you have some kind of checklist for racking new hardware? Could you add an item to check that new servers always have "... [09:44:58] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [09:51:05] !log reimaging mw2200-2203 to jessie [09:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:54:44] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2601046 (10fgiunchedi) [09:55:35] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/307922 (https://phabricator.wikimedia.org/T144470) (owner: 10Marostegui) [09:56:49] !log gehel@palladium conftool action : set/pooled=inactive; selector: elastic1044.eqiad.wmnet [09:56:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:57:33] !log gehel@palladium conftool action : set/pooled=inactive; selector: elastic1045.eqiad.wmnet [09:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:57:40] !log gehel@palladium conftool action : set/pooled=inactive; selector: elastic1046.eqiad.wmnet [09:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:58:33] !log adding marostegui to wmf and ops on wikitech LDAP [09:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:58:40] 06Operations, 10MediaWiki-Maintenance-scripts, 06Performance-Team, 10Thumbor: ensure thumbor container access is preserved by mw filebackend setzoneaccess - https://phabricator.wikimedia.org/T144479#2601048 (10fgiunchedi) [10:03:36] I am sorry, but every time I touch the LDAP, the procedure has changed :-/ [10:03:51] good news is that it makes sense now [10:10:48] 06Operations, 10Ops-Access-Requests: onboarding Manuel Arostegui in ops - https://phabricator.wikimedia.org/T144469#2601067 (10jcrespo) [10:11:09] (03CR) 10Muehlenhoff: "Yes, PCC output is at http://puppet-compiler.wmflabs.org/3897/" [puppet] - 10https://gerrit.wikimedia.org/r/307510 (https://phabricator.wikimedia.org/T144043) (owner: 10Muehlenhoff) [10:13:24] (03CR) 10Jcrespo: [C: 032] Add Manuel Arostegui (marostegui) cluster access with ops rights [puppet] - 10https://gerrit.wikimedia.org/r/307922 (https://phabricator.wikimedia.org/T144470) (owner: 10Marostegui) [10:15:13] PROBLEM - MariaDB Slave Lag: s1 on db1072 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 796.63 seconds [10:15:38] and so it begins ;p [10:15:44] db1072? [10:16:02] no, I think it is a slow slave only, the one where we ca [10:16:12] it is, and I'm only doing half of the stubs at a time [10:16:13] *n get false positives [10:16:16] so [10:16:29] do not assume yet it is the dumps [10:16:30] i.e. 14 jobs running [10:17:12] !log repooled elastic104[456] - T144450 [10:17:12] just paranoid... [10:17:13] T144450: Elastica warning about Retrying connection to search.svc.eqiad.wmnet - https://phabricator.wikimedia.org/T144450 [10:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:17:34] the but I am not going to lie, it looks like it [10:17:57] but we have already done a month's, well two month's run with 14 stubs going and it's been fine [10:18:05] so it must be interaction with something else at the same time [10:18:12] it is ok, as I said [10:18:37] it is the vslow, dumps slave [10:18:57] sadly, it is not easy to make icinga aware of mediawiki configuration dynamically [10:19:06] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 07Wikimedia-log-errors: Elastica warning about Retrying connection to search.svc.eqiad.wmnet - https://phabricator.wikimedia.org/T144450#2601086 (10Gehel) 05Open>03Resolved a:03Gehel Issue was related to PyBal not re-doing DNS res... [10:19:13] maybe it is worth trying to figure out exactly what combination of queries is responsible [10:19:31] so we will just ack it, hope we have a quick solution with no pages next month [10:19:41] apergos, sure [10:19:59] we know the stubs when the run later int he month don't cause this [10:19:59] it is just that we shouls not be too concerned about it [10:20:02] right [10:20:14] I just hate it because suppose something actually breaks [10:20:22] (03PS4) 10Hashar: noc: link to conftool and wikitech pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307920 [10:20:32] and we won't see it because 'well there's always this problem at the beginning of the month' [10:20:40] apergos, we will have to figure out a way [10:20:53] godog: thank you for the noc.wm.o review. I have sent another tiny indentation fix https://gerrit.wikimedia.org/r/#/c/307920/3..4/docroot/noc/index.html [10:20:59] well if you open a new ticket, add me to it please [10:21:00] sigh [10:21:02] godog: if that is a +1 for you I am going to deploy it [10:21:21] would saves me from looking at outdated pybal files ] [10:21:27] (03CR) 10Filippo Giunchedi: [C: 031] noc: link to conftool and wikitech pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307920 (owner: 10Hashar) [10:21:34] hashar: yup, LGTM thanks! [10:21:41] hmm I just got the page... quite delayed [10:21:54] (03CR) 10Hashar: [C: 032] noc: link to conftool and wikitech pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307920 (owner: 10Hashar) [10:22:22] (03Merged) 10jenkins-bot: noc: link to conftool and wikitech pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307920 (owner: 10Hashar) [10:22:55] apergos, I think there was some collation script running [10:23:14] it just happens that this server is very loaded and mediawiki doesn't wait for it [10:23:15] oh that's right, isn't this the uc change? [10:23:45] I just need the glue between pooling status/tags and icinga [10:24:08] which obvioulsy the good way of doing it is etcd [10:24:11] !log hashar@tin Synchronized docroot/noc/index.html: link to conftool and wikitech pages on https://noc.wikimedia.org/ (duration: 00m 47s) [10:24:15] but we are not yet there [10:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:24:38] I accept suggestions for a temporary patch [10:25:25] I actually don't mind seeing the false positives as long as we can tell pretty quickly that they are indeed ok to ignore [10:25:31] or not [10:25:58] I don't have good thoughts on that right now but I will keep it in the back of my mind [10:26:08] well, it would be fine, if it wasn't becase it pages [10:26:19] yep [10:26:41] becase I have yet no way to discern from a "slow slave" and a "SPOF slave" [10:27:10] well in our case they are the same thing, sadly [10:27:16] for now [10:27:34] apergos, while I appreciate dumps slaves [10:27:44] these will not cause imediate issues to users [10:27:46] no [10:27:52] it's the rest of the stuff run over there though [10:28:15] no, preciselly, we leave there only things like terbium jobs [10:28:27] I thought there was some job queue related stuff that ran on the vslows [10:28:28] no? [10:28:46] no, jobs are executed on all slaves, load balanced [10:29:02] ok, maybe it was just discussion about moving one type of job off [10:29:05] I only added a single job temporarily, as a desperate measure [10:29:10] ah that was it then [10:29:11] yes, I want that [10:29:20] but that will be on regular slaves [10:30:02] it's complicated; the roles were a good idea, but makes things more complex [10:30:11] funny how that works [10:30:25] *funny that works [10:30:37] :-D [10:31:06] so I will check later if this is a 1-time thing because of the script [10:31:42] and try to think a way for the selective incinga checks [10:32:53] I'm interested in the first half of that, for sure [10:33:09] I can probably dial back the number of pages per stub I request at once [10:33:12] if we need that [10:33:20] so each query would be shorter [10:34:10] I think it is combined with the bad cirrus queries [10:34:21] apergos, check https://tendril.wikimedia.org/report/slow_queries?host=%5Edb1072&user=&schema=&qmode=eq&query=&hours=1 [10:35:01] so, yes, I caused a page [10:35:14] but on the other side, I think I prevented a production outage [10:35:34] apergos, sorry you are affected by this, I had to chose the least of 2 evils [10:35:47] I don't mind the page [10:36:01] I just mind the idea that my dumps are dragging down the host in combination with other things [10:36:06] so that's what I really want to fix [10:36:08] and technically, dumps shouldn't be affected by lag [10:36:35] I'm looking at the tendril page but the cirrus queries all seem to be pretty short lived in comparison [10:37:11] this is one of those cases where I don't know enough to interpret based on the data [10:37:33] yes, but very frequent 7 second queries will overload the server easily [10:38:40] PROBLEM - puppet last run on sca2001 is CRITICAL: CRITICAL: Puppet has 1 failures [10:38:40] I guess the flaggedrevsstats must be some update job? [10:39:19] RECOVERY - swift-account-replicator on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [10:39:19] RECOVERY - swift-object-server on ms-be1016 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [10:39:19] RECOVERY - MD RAID on ms-be1016 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [10:39:39] RECOVERY - swift-object-updater on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [10:39:45] !log reboot ms-be1016, stuck again [10:39:49] RECOVERY - swift-account-auditor on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [10:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:40:07] (03PS1) 10Niharika29: Test PageAssessments on English Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307927 (https://phabricator.wikimedia.org/T142056) [10:40:09] RECOVERY - SSH on ms-be1016 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.7 (protocol 2.0) [10:40:09] apergos, yes, probably a terbium maintenace cron [10:40:38] I thought those ran during the 15-20th. or maybe those were some other set of special pages [10:40:39] grrr [10:40:50] RECOVERY - swift-container-updater on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [10:41:00] RECOVERY - salt-minion processes on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:41:00] RECOVERY - configured eth on ms-be1016 is OK: OK - interfaces up [10:41:20] RECOVERY - swift-object-replicator on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [10:41:33] RECOVERY - swift-object-auditor on ms-be1016 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [10:41:39] RECOVERY - swift-account-reaper on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [10:42:41] well the flaggedrevs script runs every two hours [10:42:44] presumably that's not it then [10:43:43] or I mean not a contributing factor. I guess I'll look into making the stubs queries shorter [10:45:01] RECOVERY - NTP on ms-be1016 is OK: NTP OK: Offset -0.001034259796 secs [10:49:59] !log installing libidn security updates [10:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:50:23] 06Operations, 06Discovery, 06Maps, 06WMF-Legal, 03Maps-Sprint: Define tile usage policy - https://phabricator.wikimedia.org/T141815#2513156 (10Pnorman) > bulk download (select an area of the map, a min and a max zoom levels, and downloading each tiles one after one) Bulk downloading high zoom image tile... [10:53:49] PROBLEM - puppet last run on sca2002 is CRITICAL: CRITICAL: Puppet has 1 failures [11:00:13] 06Operations, 10Ops-Access-Requests: onboarding Manuel Arostegui in ops - https://phabricator.wikimedia.org/T144469#2601152 (10jcrespo) [11:00:14] 06Operations, 10Ops-Access-Requests: Grant Manuel root on WMF cluster - https://phabricator.wikimedia.org/T144470#2601149 (10jcrespo) 05Open>03Resolved a:03jcrespo Marostegui has sudo and has confirmed me he has access to the cluster and mysql. [11:01:01] PROBLEM - puppet last run on sca1001 is CRITICAL: CRITICAL: Puppet has 1 failures [11:01:21] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 07Wikimedia-log-errors: Elastica warning about Retrying connection to search.svc.eqiad.wmnet - https://phabricator.wikimedia.org/T144450#2601153 (10hashar) The spam in logstash is gone :] Well done ops! [11:01:42] PROBLEM - puppet last run on sca1002 is CRITICAL: CRITICAL: Puppet has 1 failures [11:02:14] 06Operations, 10Ops-Access-Requests: onboarding Manuel Arostegui in ops - https://phabricator.wikimedia.org/T144469#2601154 (10jcrespo) [11:04:54] elukey: hello, was it you that proposed to push the zuul.Deb package to apt.wm.o ? [11:05:43] the .changes lacked the orig.tar.gz checksum. And looks like the latest version I have build is still missing it :( [11:05:49] eg https://people.wikimedia.org/~hashar/debs/zuul_2.5.0-8-gcbc7f62-wmf2precise1/zuul_2.5.0-8-gcbc7f62-wmf2precise1_amd64.changes [11:09:16] 06Operations, 06Community-Tech, 10wikidiff2, 13Patch-For-Review: Deploy new version of wikidiff2 package - https://phabricator.wikimedia.org/T140443#2601171 (10MoritzMuehlenhoff) >>! In T140443#2599180, @Legoktm wrote: > Hmm, what about wikitech/silver which is still using PHP5 on trusty? The old version... [11:09:55] I think moritzm mentioned it IIRC but I can help [11:09:56] :) [11:10:27] (03PS1) 10Jcrespo: icinga: add Manuel Arostegui to icinga privileged accounts [puppet] - 10https://gerrit.wikimedia.org/r/307928 (https://phabricator.wikimedia.org/T144469) [11:11:56] (03CR) 10Jcrespo: [C: 04-1] "Not yet on the private repo." [puppet] - 10https://gerrit.wikimedia.org/r/307928 (https://phabricator.wikimedia.org/T144469) (owner: 10Jcrespo) [11:16:07] there is something with passing -sa apparently [11:16:22] but cant figure out how git buildpackage does or does not pass it [11:19:44] (03PS1) 10Filippo Giunchedi: site: add prometheus node_exporter to more codfw machines [puppet] - 10https://gerrit.wikimedia.org/r/307929 (https://phabricator.wikimedia.org/T140646) [11:22:14] (03CR) 10Filippo Giunchedi: "Adding respective service owners for notification, we haven't seen any adverse impact on DB hosts in codfw/eqiad though." [puppet] - 10https://gerrit.wikimedia.org/r/307929 (https://phabricator.wikimedia.org/T140646) (owner: 10Filippo Giunchedi) [11:23:40] (03PS2) 10Jcrespo: icinga: add Manuel Arostegui to icinga privileged accounts [puppet] - 10https://gerrit.wikimedia.org/r/307928 (https://phabricator.wikimedia.org/T144469) [11:26:01] RECOVERY - puppet last run on ms-be1016 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [11:28:44] (03PS1) 10Cmjohnson: Relocating elastic1028 from row D to row B. Changing DNS entries to matc:wqh [dns] - 10https://gerrit.wikimedia.org/r/307930 [11:29:34] (03CR) 10Cmjohnson: [C: 032] Relocating elastic1028 from row D to row B. Changing DNS entries to matc:wqh [dns] - 10https://gerrit.wikimedia.org/r/307930 (owner: 10Cmjohnson) [11:33:31] (03PS1) 10Muehlenhoff: Bump the ABI of our kernel package to 2 [debs/linux44] - 10https://gerrit.wikimedia.org/r/307931 [11:34:10] (03PS1) 10Gilles: Configure Thumbor to use statsd for metrics [puppet] - 10https://gerrit.wikimedia.org/r/307932 (https://phabricator.wikimedia.org/T144478) [11:36:00] log depooling elastic102[1289] - T143685 [11:36:00] T143685: Improve balance of nodes across rows for elasticsearch cluster eqiad - https://phabricator.wikimedia.org/T143685 [11:38:25] (03CR) 10Filippo Giunchedi: [C: 031] keyholder-proxy/agent: Convert to base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/307510 (https://phabricator.wikimedia.org/T144043) (owner: 10Muehlenhoff) [11:46:08] (03CR) 10Filippo Giunchedi: [C: 032] Configure Thumbor to use statsd for metrics [puppet] - 10https://gerrit.wikimedia.org/r/307932 (https://phabricator.wikimedia.org/T144478) (owner: 10Gilles) [11:47:25] (03CR) 10Muehlenhoff: [C: 032] Bump the ABI of our kernel package to 2 [debs/linux44] - 10https://gerrit.wikimedia.org/r/307931 (owner: 10Muehlenhoff) [11:49:05] (03CR) 10Daniel Kinzler: "Now that Icf71cdb7 is merged, is this still needed?" [puppet] - 10https://gerrit.wikimedia.org/r/256663 (https://phabricator.wikimedia.org/T75997) (owner: 10Thiemo Mättig (WMDE)) [11:50:21] (03CR) 10Marostegui: [C: 032] icinga: add Manuel Arostegui to icinga privileged accounts [puppet] - 10https://gerrit.wikimedia.org/r/307928 (https://phabricator.wikimedia.org/T144469) (owner: 10Jcrespo) [11:50:26] (03PS1) 10Alexandros Kosiaris: Partially Revert "Clarify string in weekly Phabricator Project email" [puppet] - 10https://gerrit.wikimedia.org/r/307935 [11:51:01] (03PS2) 10Alexandros Kosiaris: Partially Revert "Clarify string in weekly Phabricator Project email" [puppet] - 10https://gerrit.wikimedia.org/r/307935 [11:51:07] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Partially Revert "Clarify string in weekly Phabricator Project email" [puppet] - 10https://gerrit.wikimedia.org/r/307935 (owner: 10Alexandros Kosiaris) [11:51:26] (03PS2) 10Filippo Giunchedi: site: add prometheus node_exporter to more codfw machines [puppet] - 10https://gerrit.wikimedia.org/r/307929 (https://phabricator.wikimedia.org/T140646) [11:53:46] (03CR) 10Jcrespo: [C: 031] icinga: add Manuel Arostegui to icinga privileged accounts [puppet] - 10https://gerrit.wikimedia.org/r/307928 (https://phabricator.wikimedia.org/T144469) (owner: 10Jcrespo) [11:56:17] (03PS3) 10Marostegui: icinga: add Manuel Arostegui to icinga privileged accounts [puppet] - 10https://gerrit.wikimedia.org/r/307928 (https://phabricator.wikimedia.org/T144469) (owner: 10Jcrespo) [12:04:58] (03PS1) 10Cmjohnson: Relocating elastic1021 and 1022 to row C, corresponding dns changes [dns] - 10https://gerrit.wikimedia.org/r/307936 [12:05:37] (03CR) 10Cmjohnson: [C: 032] Relocating elastic1021 and 1022 to row C, corresponding dns changes [dns] - 10https://gerrit.wikimedia.org/r/307936 (owner: 10Cmjohnson) [12:12:32] !log rolling restart of ferm on elasticsearch eqiad cluster to account for moved servers - T143685 [12:12:33] T143685: Improve balance of nodes across rows for elasticsearch cluster eqiad - https://phabricator.wikimedia.org/T143685 [12:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:24:54] hello I need an admin at wikitechwiki [12:24:58] vandalism in progress [12:26:22] Dereckson: ping, you admin at wikitech? [12:28:53] (03CR) 10Hashar: [C: 031] "We have tested on beta cluster mira and deployment-tin instances (both Trusty) by:" [puppet] - 10https://gerrit.wikimedia.org/r/307510 (https://phabricator.wikimedia.org/T144043) (owner: 10Muehlenhoff) [12:32:03] AaronSchulz: ping [12:46:17] (03PS4) 10Gehel: elasticsearch - disable cron to clear elasticsearch caches [puppet] - 10https://gerrit.wikimedia.org/r/307779 (https://phabricator.wikimedia.org/T144396) [12:46:17] mafk: hmm? [12:46:39] AaronSchulz: there's a vandal in wikitechwiki, could you please block it? [12:47:02] https://wikitech.wikimedia.org/wiki/Special:Contributions/Adam_Hilter [12:47:40] I don't nominally have admin there [12:48:35] (03CR) 10Gehel: [C: 032] elasticsearch - disable cron to clear elasticsearch caches [puppet] - 10https://gerrit.wikimedia.org/r/307779 (https://phabricator.wikimedia.org/T144396) (owner: 10Gehel) [12:48:46] ah, well, okay - I though you had [12:48:51] sorry [12:49:52] https://wikitech.wikimedia.org/w/index.php?title=Special%3AListUsers&username=&group=sysop&limit=50 [12:50:01] legoktm is probably the easiest to poke [12:50:25] (03CR) 10Gehel: [C: 032] elastic102[1289] moved to new racks [puppet] - 10https://gerrit.wikimedia.org/r/307733 (https://phabricator.wikimedia.org/T143685) (owner: 10Gehel) [12:50:32] (03PS2) 10Gehel: elastic102[1289] moved to new racks [puppet] - 10https://gerrit.wikimedia.org/r/307733 (https://phabricator.wikimedia.org/T143685) [12:51:23] or WikiSysop :P [12:52:30] it doesn't seem urgent atm [12:52:51] nope [12:52:55] if it was I could op myself with SQL ;) [12:53:00] lol [12:53:22] userrights at meta doesn't work (as it should be fwiw) [12:53:55] !log ema@palladium conftool action : set/pooled=yes; selector: cp4005.ulsfo.wmnet (tags: ['dc=ulsfo', 'cluster=cache_upload', 'service=varnish-be']) [12:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:00:04] hashar, Dereckson, addshore, and aude: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160901T1300). Please do the needful. [13:00:52] hashar, Dereckson: no eu swat today, right? https://wikitech.wikimedia.org/wiki/Deployments#Thursday.2C.C2.A0September.C2.A001 [13:01:35] !log gehel@palladium conftool action : set/pooled=yes; selector: name=elastic1047.eqiad.wmnet [13:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:06:21] zeljkof: right, nothing to deploy [13:07:10] until next week then :) [13:08:43] (03PS5) 10Muehlenhoff: keyholder-proxy/agent: Convert to base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/307510 (https://phabricator.wikimedia.org/T144043) [13:13:27] (03CR) 10Muehlenhoff: [C: 032] keyholder-proxy/agent: Convert to base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/307510 (https://phabricator.wikimedia.org/T144043) (owner: 10Muehlenhoff) [13:13:41] (03PS1) 10BBlack: package::builder: EXPERIMENTAL=yes support [puppet] - 10https://gerrit.wikimedia.org/r/307943 [13:16:01] !log upgrading httpd/apache to 2.4.10-10+deb8u6+wmf2 on mw130[01].eqiad.wmnet [13:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:18:20] (03PS1) 10Cmjohnson: relocating elastic1029, corresponding dns change [dns] - 10https://gerrit.wikimedia.org/r/307944 [13:18:45] (03CR) 10Cmjohnson: [C: 032] relocating elastic1029, corresponding dns change [dns] - 10https://gerrit.wikimedia.org/r/307944 (owner: 10Cmjohnson) [13:22:05] an snap these are jobrunners [13:22:22] PROBLEM - Keyholder SSH agent on mira is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [13:24:10] mm any reason why this is happening --^ ? [13:24:40] (03CR) 10Gehel: maps - use new tileshell.js script to notify tilerator of expired tiles (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/307787 (https://phabricator.wikimedia.org/T139451) (owner: 10Gehel) [13:25:55] elukey: you mean the keyholder? that's fixed and a result of my patch https://gerrit.wikimedia.org/r/307510 [13:26:12] after a reboot/restart of the keyholder service, the keys need to be readded [13:26:35] super missed your patch [13:26:43] I was wondering if it was on purpose or not [13:26:44] thanks :) [13:27:23] RECOVERY - Keyholder SSH agent on mira is OK: OK: Keyholder is armed with all configured keys. [13:29:48] moritzm: in order to "depool" a jobrunner I'd need to stop jobchron right? conftool does not help a lot in this use case [13:30:15] !log hashar@tin Synchronized php-1.28.0-wmf.17/includes/page/WikiPage.php: T144484 (duration: 00m 35s) [13:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:30:43] 13:30:15 336 apaches had sync errors [13:31:17] wow [13:31:57] yeah that is the keyholder that got stopped just when I have hit scap sync-file [13:32:13] moritzm is moving it to base::service_unit and rearming the keys [13:32:21] super luck [13:32:22] :) [13:33:59] hashar: now rearmed [13:34:22] moritzm: all good on tin.eqiad.wmnet. Well done! [13:34:23] !log hashar@tin Synchronized php-1.28.0-wmf.17/includes/page/WikiPage.php: T144484 (duration: 00m 49s) [13:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:34:27] elukey: I think so, I've simply always upgraded jobrunners in small batches [13:34:31] that validates the transition to base::service_unit [13:34:33] (03CR) 10Ottomata: [C: 031] Remove g+w flag to /srv/log/eventlogging to allow proper logrotation [puppet] - 10https://gerrit.wikimedia.org/r/307918 (https://phabricator.wikimedia.org/T132324) (owner: 10Elukey) [13:36:36] mmmm doesn't seem to stop the POST to hhvm [13:37:34] ah no that's the jobrunner's job [13:37:44] It is a separate entity [13:39:14] !log not upgrading mw130[01] since I'd need more info before proceeding [13:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:39:27] I need to study them a bit more [13:39:32] will proceed with API only [13:41:16] !log uploaded openssl-1.1.0-1+wmf1 to jessie-wikimedia/experimental [13:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:42:13] !log upgrading httpd/apache to 2.4.10-10+deb8u6+wmf2 on mw128[78] [13:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:50:04] (03PS2) 10Elukey: Remove g+w flag to /srv/log/eventlogging to allow proper logrotation [puppet] - 10https://gerrit.wikimedia.org/r/307918 (https://phabricator.wikimedia.org/T132324) [13:51:12] (03CR) 10Elukey: [C: 032] Remove g+w flag to /srv/log/eventlogging to allow proper logrotation [puppet] - 10https://gerrit.wikimedia.org/r/307918 (https://phabricator.wikimedia.org/T132324) (owner: 10Elukey) [14:00:56] (03PS11) 10Thiemo Mättig (WMDE): Rewrite outdated comment about Gerrit-Phabricator linking [puppet] - 10https://gerrit.wikimedia.org/r/256663 (https://phabricator.wikimedia.org/T75997) [14:02:58] (03CR) 10Yurik: [C: 031] maps - use new tileshell.js script to notify tilerator of expired tiles [puppet] - 10https://gerrit.wikimedia.org/r/307787 (https://phabricator.wikimedia.org/T139451) (owner: 10Gehel) [14:03:57] (03CR) 10Paladox: [C: 031] Rewrite outdated comment about Gerrit-Phabricator linking [puppet] - 10https://gerrit.wikimedia.org/r/256663 (https://phabricator.wikimedia.org/T75997) (owner: 10Thiemo Mättig (WMDE)) [14:04:59] (03PS2) 10BBlack: package::builder: EXPERIMENTAL=yes support [puppet] - 10https://gerrit.wikimedia.org/r/307943 [14:11:30] !log wipe and reinitialize corrupted xfs on /dev/sdn1 on ms-be1016 [14:11:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:14:32] 06Operations: Multiple servers in codfw fail to respond to IPMI commands during reimaging - https://phabricator.wikimedia.org/T142726#2601432 (10Papaul) @MoritzMuehlenhoff I think we need to power down the hosts. For also to be on the safe side to make sure that the settings are saved a applied by the hosts I re... [14:25:22] (03CR) 10Filippo Giunchedi: package::builder: EXPERIMENTAL=yes support (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/307943 (owner: 10BBlack) [14:27:28] !log restbase deploy start of 9cca320 [14:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:34:27] RECOVERY - puppet last run on sca1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:34:56] 06Operations: Blocked /etc/passwd on sca100[1234] hosts - https://phabricator.wikimedia.org/T144492#2601459 (10jcrespo) [14:37:55] (03CR) 10Filippo Giunchedi: "If I understand correctly what would be stored either on swift or local FS would be math renderings, correct? IOW if we lose silver's FS t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307494 (https://phabricator.wikimedia.org/T126338) (owner: 10Dereckson) [14:39:16] !log upgrading httpd/apache to 2.4.10-10+deb8u6+wmf2 on mw128[56] [14:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:39:34] 06Operations: Blocked /etc/passwd on sca100[1234] hosts - https://phabricator.wikimedia.org/T144492#2601480 (10jcrespo) Debugging on IRC: ``` so this time around it's /etc/passwd that's locked not /etc/passwd+ https://github.com/netblue30/firejail/issues/559 ? !log powered down several hosts for hardware maintenance (T142726): mw2087, mw2149-mw2151 [14:40:44] T142726: Multiple servers in codfw fail to respond to IPMI commands during reimaging - https://phabricator.wikimedia.org/T142726 [14:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:42:25] ACKNOWLEDGEMENT - puppet last run on sca1002 is CRITICAL: CRITICAL: Puppet has 1 failures Jcrespo Firejail/kernel issue: https://phabricator.wikimedia.org/T144492 [14:42:25] ACKNOWLEDGEMENT - puppet last run on sca2001 is CRITICAL: CRITICAL: Puppet has 1 failures Jcrespo Firejail/kernel issue: https://phabricator.wikimedia.org/T144492 [14:42:25] ACKNOWLEDGEMENT - puppet last run on sca2002 is CRITICAL: CRITICAL: Puppet has 1 failures Jcrespo Firejail/kernel issue: https://phabricator.wikimedia.org/T144492 [14:42:34] !log restbase deploy end of 9cca320 [14:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:42:39] RECOVERY - puppet last run on sca1002 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [14:42:55] oh? [14:43:37] did someone workaround it? [14:45:30] (03PS3) 10Gehel: maps - use new tileshell.js script to notify tilerator of expired tiles [puppet] - 10https://gerrit.wikimedia.org/r/307787 (https://phabricator.wikimedia.org/T139451) [14:47:10] (03PS3) 10Dereckson: Enable Math on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307494 (https://phabricator.wikimedia.org/T126338) [14:47:28] RECOVERY - puppet last run on sca2001 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [14:47:49] (03CR) 10Dereckson: "PS3: switched Math file storage to local-backend." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307494 (https://phabricator.wikimedia.org/T126338) (owner: 10Dereckson) [14:52:08] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: onboarding Manuel Arostegui in ops - https://phabricator.wikimedia.org/T144469#2601512 (10jcrespo) [14:52:25] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: onboarding Manuel Arostegui in ops - https://phabricator.wikimedia.org/T144469#2600739 (10jcrespo) [14:53:58] (03CR) 10Alexandros Kosiaris: [C: 04-1] maps - use new tileshell.js script to notify tilerator of expired tiles (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/307787 (https://phabricator.wikimedia.org/T139451) (owner: 10Gehel) [14:55:08] PROBLEM - HP RAID on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [14:57:38] RECOVERY - HP RAID on ms-be1016 is OK: OK: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2, Controller [14:57:42] (03CR) 10Gehel: maps - use new tileshell.js script to notify tilerator of expired tiles (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/307787 (https://phabricator.wikimedia.org/T139451) (owner: 10Gehel) [14:58:33] !log powered down several hosts for hardware maintenance (T142726): mw2099, mw2102, mw2117, mw2163-mw2199 [14:58:34] T142726: Multiple servers in codfw fail to respond to IPMI commands during reimaging - https://phabricator.wikimedia.org/T142726 [14:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:59:01] 06Operations: Multiple servers in codfw fail to respond to IPMI commands during reimaging - https://phabricator.wikimedia.org/T142726#2601552 (10MoritzMuehlenhoff) Hi Papaul, first batch: These are depooled from the cluster and powered down: mw2087 mw2099 mw2102 mw2117 mw2149-mw2151 mw2163-mw2199 [15:03:53] (03PS1) 10Rush: openstack: tune sqlalchemy pooling behavior [puppet] - 10https://gerrit.wikimedia.org/r/307956 (https://phabricator.wikimedia.org/T144339) [15:04:27] (03CR) 10Anomie: Do not use $wgExtensionFunctions to set globals (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307893 (https://phabricator.wikimedia.org/T143055) (owner: 10Gergő Tisza) [15:05:15] (03PS4) 10Andrew Bogott: Set up a root password for Labs instances [puppet] - 10https://gerrit.wikimedia.org/r/307086 (https://phabricator.wikimedia.org/T142531) [15:05:41] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM from swift/production perspective" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307494 (https://phabricator.wikimedia.org/T126338) (owner: 10Dereckson) [15:06:37] (03PS2) 10Rush: openstack: tune sqlalchemy pooling behavior [puppet] - 10https://gerrit.wikimedia.org/r/307956 (https://phabricator.wikimedia.org/T144339) [15:07:39] (03CR) 10Andrew Bogott: [C: 031] "Yep, worth a try!" [puppet] - 10https://gerrit.wikimedia.org/r/307956 (https://phabricator.wikimedia.org/T144339) (owner: 10Rush) [15:08:59] (03PS3) 10BBlack: package::builder: EXPERIMENTAL=yes support [puppet] - 10https://gerrit.wikimedia.org/r/307943 [15:09:18] (03PS1) 10Dzahn: point wikipedia.in to 180.179.52.130 [dns] - 10https://gerrit.wikimedia.org/r/307959 [15:09:35] (03CR) 10Alexandros Kosiaris: maps - use new tileshell.js script to notify tilerator of expired tiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/307787 (https://phabricator.wikimedia.org/T139451) (owner: 10Gehel) [15:11:00] (03CR) 10BBlack: [C: 032] package::builder: EXPERIMENTAL=yes support [puppet] - 10https://gerrit.wikimedia.org/r/307943 (owner: 10BBlack) [15:11:27] (03Abandoned) 10Mobrovac: RESTBase: disable firejail [puppet] - 10https://gerrit.wikimedia.org/r/300276 (https://phabricator.wikimedia.org/T136957) (owner: 10Mobrovac) [15:11:28] 06Operations: Blocked /etc/passwd on sca100[1234] hosts - https://phabricator.wikimedia.org/T144492#2601459 (10akosiaris) I 've done a ``` service zotero stop puppet agent -t -v service zotero stop /usr/bin/gpasswd ops -M filippo,jgreen,bblack,andrew,faidon,rush,oblivian,laner,yuvipanda,dzahn,akosiaris,spring... [15:11:29] (03CR) 10Gehel: maps - use new tileshell.js script to notify tilerator of expired tiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/307787 (https://phabricator.wikimedia.org/T139451) (owner: 10Gehel) [15:11:59] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: onboarding Manuel Arostegui in ops - https://phabricator.wikimedia.org/T144469#2601629 (10jcrespo) 05Open>03Resolved a:03jcrespo So, aside from the subtask T144496, most tasks were completed and checked. There are some small hanging issues, lik... [15:13:00] (03CR) 10Andrew Bogott: [C: 032] Set up a root password for Labs instances [puppet] - 10https://gerrit.wikimedia.org/r/307086 (https://phabricator.wikimedia.org/T142531) (owner: 10Andrew Bogott) [15:13:05] (03PS5) 10Andrew Bogott: Set up a root password for Labs instances [puppet] - 10https://gerrit.wikimedia.org/r/307086 (https://phabricator.wikimedia.org/T142531) [15:17:26] (03PS4) 10Gehel: maps - use new tileshell.js script to notify tilerator of expired tiles [puppet] - 10https://gerrit.wikimedia.org/r/307787 (https://phabricator.wikimedia.org/T139451) [15:18:34] (03CR) 10jenkins-bot: [V: 04-1] maps - use new tileshell.js script to notify tilerator of expired tiles [puppet] - 10https://gerrit.wikimedia.org/r/307787 (https://phabricator.wikimedia.org/T139451) (owner: 10Gehel) [15:19:28] (03PS6) 10Andrew Bogott: Set up a root password for Labs instances [puppet] - 10https://gerrit.wikimedia.org/r/307086 (https://phabricator.wikimedia.org/T142531) [15:23:17] PROBLEM - Host wtp2016 is DOWN: PING CRITICAL - Packet loss = 100% [15:26:19] (03PS5) 10Gehel: maps - use new tileshell.js script to notify tilerator of expired tiles [puppet] - 10https://gerrit.wikimedia.org/r/307787 (https://phabricator.wikimedia.org/T139451) [15:30:43] andrewbogott, when you connect, lots of cronspam from labtestservices2001:/etc/dns-floating-ip-updater.py [15:30:58] I've already reported the one from stats1002 [15:31:08] jynus: ok, that's probably a Krenair issue [15:31:16] oh, sorry [15:31:23] I just assumed it was you [15:31:28] okay, what does it say jynus? [15:32:27] https://www.irccloud.com/pastebin/3RXsKQjG/ [15:32:31] Krenair: ^ [15:32:47] (but I haven't actually thought about it yet) [15:33:46] (03PS6) 10Gehel: maps - use new tileshell.js script to notify tilerator of expired tiles [puppet] - 10https://gerrit.wikimedia.org/r/307787 (https://phabricator.wikimedia.org/T139451) [15:34:56] !log reboot nova-compute on labvirt1013 as stuck (no logs, not applying any changes or taking any instruction) [15:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:35:10] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [15:36:17] !log gehel@palladium conftool action : set/pooled=yes; selector: name=elastic102[12].eqiad.wmnet [15:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:37:52] 06Operations, 06Labs, 06Release-Engineering-Team, 10wikitech.wikimedia.org, 07LDAP: Rename specific account in LDAP, Wikitech and Gerrit - https://phabricator.wikimedia.org/T133968#2601696 (10Sophivorus) Thanks! So @chasemp or @Andrew told you it's not ok to change a user's UID in LDAP? [15:38:42] andrewbogott, jynus: apparently it fails at this project: createprojecttest10 [15:38:51] andrewbogott, jynus: wonder why novaadmin can't list servers in that project [15:39:02] Krenair: I can just delete the project :) [15:39:02] Probably novaadmin isn't a member [15:39:02] although I don't know why that would've changed yesterday [15:39:46] As far as I'm concerned, a project that novaadmin can't administrate is a broken project [15:39:48] 06Operations, 10ops-codfw: Broken disk on wtp2016 - https://phabricator.wikimedia.org/T144260#2601719 (10Papaul) 05Open>03Resolved a:05Papaul>03akosiaris This server has basic support which doesn't convert SATA disk replacement. I used one of the spare disks on-site for replacement. [15:39:49] Nothing wrong with my script [15:40:07] Krenair: I agree [15:40:07] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [15:40:09] let me check [15:40:10] Krenair, nothing assumed your script was wrong [15:40:33] although maybe if it is a cron it shouldn't need to mail to root the errors? [15:40:43] No I know, I'm just saying - this particular issue is not simply my thing [15:41:20] my Engrish is getting worse and worse [15:41:27] even though issues with this script could certainly be my thing [15:42:58] PROBLEM - Disk space on ms-be1004 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=82%) [15:43:35] Krenair: novaadmin was a user but not a projectadmin. So I added the projectadmin role [15:43:51] interesting [15:44:11] I'm not clear on if that means that /all/ the spam will be fixed or just that one that I pasted though [15:44:14] andrewbogott, oh. [15:44:16] so you fixed one [15:44:22] now it fails on the next: createprojecttest11 [15:44:52] I actually don't know what the procedure is to create projects, or if there's some script [15:45:10] RECOVERY - check_puppetrun on heka is OK: OK: Puppet is currently enabled, last run 173 seconds ago with 0 failures [15:45:21] I'm going to look for projects where novaadmin only has one role vs. two and delete those projects [15:45:24] I'll take a look at ms-be1004 shortly [15:45:36] probably I made those from the commandline and they are (unsurprisingly) broken as a result [15:46:58] finally s3 schema change arrives to production hosts [15:48:47] next time someone proposes "INSERT SELECT" cannot be that slow, I will show them the times for a schema change on pagelinks [15:51:13] jynus, Krenair, I can't promise that I got all of them but I just deleted many broken projects. We'll see if we still get any of those cron notices. [15:51:20] (03CR) 10Rush: [C: 032] openstack: tune sqlalchemy pooling behavior [puppet] - 10https://gerrit.wikimedia.org/r/307956 (https://phabricator.wikimedia.org/T144339) (owner: 10Rush) [15:51:25] (03PS3) 10Rush: openstack: tune sqlalchemy pooling behavior [puppet] - 10https://gerrit.wikimedia.org/r/307956 (https://phabricator.wikimedia.org/T144339) [15:52:21] 06Operations, 10ops-eqiad: rack/setup/deploy puppetmaster100[12] - https://phabricator.wikimedia.org/T143219#2601772 (10Cmjohnson) [15:52:21] andrewbogott, testcreation4 is now the failing project [15:52:33] 06Operations, 10ops-eqiad: ms-be1004.eqiad.wmnet: slot=3 dev=sdd failed - https://phabricator.wikimedia.org/T144499#2601773 (10fgiunchedi) 03NEW [15:52:36] (03CR) 10Rush: [V: 032] openstack: tune sqlalchemy pooling behavior [puppet] - 10https://gerrit.wikimedia.org/r/307956 (https://phabricator.wikimedia.org/T144339) (owner: 10Rush) [15:52:51] Krenair: that one should be gone too... [15:53:18] RECOVERY - Disk space on ms-be1004 is OK: DISK OK [15:53:50] 06Operations, 10ops-eqiad: ms-be1004.eqiad.wmnet: slot=3 dev=sdd failed - https://phabricator.wikimedia.org/T144499#2601784 (10fgiunchedi) note that the disk is reported as ok by the raid controller, linux however encounters errors while using it [15:55:30] (03PS2) 10Mobrovac: Allow service-checker to read YAML-formatted specs [software/service-checker] - 10https://gerrit.wikimedia.org/r/306707 (https://phabricator.wikimedia.org/T136839) [15:55:39] PROBLEM - HP RAID on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [15:57:03] (03CR) 10Mobrovac: Allow service-checker to read YAML-formatted specs (031 comment) [software/service-checker] - 10https://gerrit.wikimedia.org/r/306707 (https://phabricator.wikimedia.org/T136839) (owner: 10Mobrovac) [15:58:07] RECOVERY - HP RAID on ms-be1016 is OK: OK: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2, Controller [15:59:01] (03CR) 10Mobrovac: [C: 031] Properly support 'basePath' [software/service-checker] - 10https://gerrit.wikimedia.org/r/307910 (owner: 10Legoktm) [16:00:04] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160901T1600). Please do the needful. [16:00:04] mobrovac: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:42] mobrovac: looking at the patches [16:00:44] * mobrovac is here [16:00:46] kk godog [16:02:49] (03PS1) 10Ema: Revert "Upgrade upload ulsfo to Varnish 4" [puppet] - 10https://gerrit.wikimedia.org/r/307964 (https://phabricator.wikimedia.org/T144257) [16:03:35] (03PS2) 10Filippo Giunchedi: Change-Prop: Ignore non-main NS titles for Wikidata updates [puppet] - 10https://gerrit.wikimedia.org/r/307791 (owner: 10Mobrovac) [16:03:43] godog: all patches can go at once, fyi [16:05:03] (03CR) 10Filippo Giunchedi: [C: 032] Change-Prop: Ignore non-main NS titles for Wikidata updates [puppet] - 10https://gerrit.wikimedia.org/r/307791 (owner: 10Mobrovac) [16:05:09] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [16:05:59] mobrovac: ack, I'll follow the same order as Deployments [16:06:16] (03PS2) 10Ema: Revert "Upgrade upload ulsfo to Varnish 4" [puppet] - 10https://gerrit.wikimedia.org/r/307964 (https://phabricator.wikimedia.org/T144257) [16:06:25] kk [16:06:26] (03CR) 10Ema: [C: 032 V: 032] Revert "Upgrade upload ulsfo to Varnish 4" [puppet] - 10https://gerrit.wikimedia.org/r/307964 (https://phabricator.wikimedia.org/T144257) (owner: 10Ema) [16:06:52] (03PS2) 10Filippo Giunchedi: Change-Prop: Removed unused config properties [puppet] - 10https://gerrit.wikimedia.org/r/306300 (owner: 10Ppchelko) [16:08:24] (03CR) 10Filippo Giunchedi: [C: 032] Change-Prop: Removed unused config properties [puppet] - 10https://gerrit.wikimedia.org/r/306300 (owner: 10Ppchelko) [16:08:55] (03PS2) 10Filippo Giunchedi: Ignore 503 from ORES updates. [puppet] - 10https://gerrit.wikimedia.org/r/307810 (owner: 10Ppchelko) [16:09:59] !log downgrading cp4006 to varnish 3 T131502 [16:10:00] T131502: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502 [16:10:02] Any ideas on how to check if there was a problem with our ObjectCache infrastructure between 13:00 and 13:40 UTC yesterday? Or ideas about whom to ask? [16:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:10:09] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [16:10:16] (03CR) 10Addshore: "Sorry for a lack of a response (I have been on vacation)." [puppet] - 10https://gerrit.wikimedia.org/r/303863 (owner: 10Addshore) [16:11:14] (03PS1) 10Jdlrobson: Enable Wikidata descriptions taglines on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307967 (https://phabricator.wikimedia.org/T143345) [16:12:08] (03CR) 10Filippo Giunchedi: [C: 032] Ignore 503 from ORES updates. [puppet] - 10https://gerrit.wikimedia.org/r/307810 (owner: 10Ppchelko) [16:12:16] (03PS4) 10Andrew Bogott: Add new ssh key for Addshore [puppet] - 10https://gerrit.wikimedia.org/r/303863 (owner: 10Addshore) [16:12:33] mobrovac: all merged, can you check? [16:12:58] k godog, running puppet and restarting [16:14:09] (03PS1) 10Jdlrobson: Do not show Wikidata descriptions on meta or mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307968 [16:14:11] (03PS1) 10Jdlrobson: Enable on all wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307969 (https://phabricator.wikimedia.org/T143345) [16:14:44] (03CR) 10Andrew Bogott: [C: 032] Add new ssh key for Addshore [puppet] - 10https://gerrit.wikimedia.org/r/303863 (owner: 10Addshore) [16:14:48] (03CR) 10jenkins-bot: [V: 04-1] Do not show Wikidata descriptions on meta or mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307968 (owner: 10Jdlrobson) [16:14:51] thanks andrewbogott ! [16:15:02] (03CR) 10jenkins-bot: [V: 04-1] Enable on all wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307969 (https://phabricator.wikimedia.org/T143345) (owner: 10Jdlrobson) [16:15:12] RECOVERY - check_puppetrun on heka is OK: OK: Puppet is currently enabled, last run 178 seconds ago with 0 failures [16:15:26] addshore: it'll take ~30 minutes before the key is active everywhere [16:16:05] Okay! [16:17:06] godog: looking good in codfw, proceeding with eqiad [16:19:29] mobrovac: https://media.giphy.com/media/xKy2w6LehxxHa/giphy.gif [16:19:45] hahahahaha [16:21:16] (03PS2) 10Jdlrobson: End lazy loading reference experiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307435 (https://phabricator.wikimedia.org/T144240) [16:21:32] godog: looking good everywhere [16:21:34] godog: thnx! [16:21:49] mobrovac: np! [16:24:06] !log restarting pybal on lvs4002 T134893 [16:24:07] T134893: Unhandled pybal error causing services to be depooled in etcd but not in lvs - https://phabricator.wikimedia.org/T134893 [16:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:25:13] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures [16:26:35] 06Operations, 10Beta-Cluster-Infrastructure, 05Prometheus-metrics-monitoring: deploy prometheus node_exporter and server to deployment-prep - https://phabricator.wikimedia.org/T144502#2601885 (10fgiunchedi) [16:30:12] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures [16:30:23] (03PS1) 10BBlack: nginx-1.11.3 + openssl-1.1.0 temp hacky build [software/nginx] (wmf-1.11.3) - 10https://gerrit.wikimedia.org/r/307973 [16:32:43] andrewbogott, jynus: I think the issue is resolved now [16:34:05] bblack: sorry for the bother, whom or where could I consult to find out if there were issues with our object cache infrastructure (if that ever happens?) yesterday 13:00 - 13:45 UTC? thx in advance! [16:34:57] AndyRussG: what is our object cache infrastructure? [16:35:01] (in this question) [16:35:13] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures [16:35:14] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [16:35:47] bblack: whatever is behind MediaWiki's ObjectCache::getMainWANInstance() [16:35:56] 06Operations, 06WMF-NDA-Requests: Phabricator access for Manuel Arostegui to WMF-NDA and Ops-only procurement and other tasks - https://phabricator.wikimedia.org/T144496#2601909 (10Aklapper) [16:37:01] I'm not sure what that is tbh [16:37:07] 06Operations, 06WMF-NDA-Requests: Phabricator access for Manuel Arostegui to WMF-NDA and Ops-only procurement and other tasks - https://phabricator.wikimedia.org/T144496#2601603 (10Aklapper) Regarding WMF-NDA membership (such requests go to #WMF-NDA-Requests): I've added @Marostegui to #WMF-NDA. Steps I perfo... [16:37:25] bblack: maybe memcached? [16:37:39] (03PS7) 10Andrew Bogott: Set up a root password for Labs instances [puppet] - 10https://gerrit.wikimedia.org/r/307086 (https://phabricator.wikimedia.org/T142531) [16:37:44] 06Operations, 06WMF-NDA-Requests: Phabricator access for Manuel Arostegui to WMF-NDA and Ops-only procurement and other tasks - https://phabricator.wikimedia.org/T144496#2601943 (10Aklapper) Regarding access to tasks in private S4: S4 was created in T93760 and that task does not document who can edit the membe... [16:38:00] AndyRussG: probably [16:38:10] I'm looking at an unexplained bug where this code is suspected of behaving oddly: https://github.com/wikimedia/mediawiki-extensions-CentralNotice/blob/f6abdb6be74bb8620b19918dbff5c8d617f053d9/includes/ChoiceDataProvider.php#L36-L52 [16:38:49] It could be a bug or maybe also an outage, but I don't know enuf about how that works behind the scenes to know if that makes sense... [16:38:56] (https://phabricator.wikimedia.org/T144393) [16:39:42] This is as far as I got wrt doc: https://wikitech.wikimedia.org/wiki/Memcached [16:40:14] RECOVERY - check_puppetrun on boron is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [16:40:14] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [16:40:24] Maybe I should look for something on ganglia [16:42:31] AndyRussG: there doesn't seem to be any big obvious event in basic system stats for eqiad memcached then [16:43:46] bblack: K thanks... Yeah I should have remembered Gangia!! That's the right place to look 4 next time? https://ganglia.wikimedia.org/latest/?r=week&cs=08%2F30%2F2016+00%3A00&ce=09%2F01%2F2016+00%3A00&c=Memcached+eqiad&h=&tab=m&vn=&hide-hf=false&m=cpu_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name [16:44:10] AndyRussG: yeah [16:44:45] Oh wait there was something on codfw [16:44:56] !log gehel@palladium conftool action : set/pooled=yes; selector: name=elastic1028.eqiad.wmnet [16:45:00] * Jeff_Green fixing boron...^^^ [16:45:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:45:15] RECOVERY - check_puppetrun on heka is OK: OK: Puppet is currently enabled, last run 169 seconds ago with 0 failures [16:45:22] (03PS3) 10Andrew Bogott: Don't attempt to set root user password [labs/private] - 10https://gerrit.wikimedia.org/r/304321 (owner: 10Yuvipanda) [16:46:19] though not at the same time [16:47:04] https://ganglia.wikimedia.org/latest/?r=custom&cs=08%2F31%2F2016+00%3A00&ce=08%2F31%2F2016+23%3A59&c=Memcached+codfw&h=&tab=m&vn=&hide-hf=false&m=cpu_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name [16:47:11] bblack: ^ [16:48:16] (03PS13) 10Gehel: elasticsearch - move relforge to its own role [puppet] - 10https://gerrit.wikimedia.org/r/304067 [16:48:43] (03CR) 10Andrew Bogott: [C: 032 V: 032] Don't attempt to set root user password [labs/private] - 10https://gerrit.wikimedia.org/r/304321 (owner: 10Yuvipanda) [16:49:42] !log cp4006 repooled after downgrade [16:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:50:15] AndyRussG: yeah that's showing some event for codfw memcache ~05:30 on the 31st... but codfw memcache/mw isn't used for real traffic at present, right? [16:50:42] I have no idea 8p [16:50:57] hmm my time was off looking at a zoomed-out view [16:51:08] it's more like 03:42 -> 06:30 [16:51:19] still doesn't match the time of the bug I have (13:00 - 13:40) [16:51:22] jouncebot: refresh [16:51:25] I refreshed my knowledge about deployments. [16:51:41] AndyRussG: is that time definitely UTC and not mis-tz-translated somewhere? [16:51:45] Maybe there was some maintenance that could have corrupted stuff, even though no issues show up in these graphs? [16:51:55] it's possible [16:52:04] (03PS8) 10Andrew Bogott: Set up a root password for Labs instances [puppet] - 10https://gerrit.wikimedia.org/r/307086 (https://phabricator.wikimedia.org/T142531) [16:52:04] definitely UTC [16:52:17] what kind of issue happened? I thought the memcache stuff was mostly a true cache, and thus missing things aren't horribly broken [16:52:52] (03CR) 10Gehel: [C: 032] elasticsearch - move relforge to its own role [puppet] - 10https://gerrit.wikimedia.org/r/304067 (owner: 10Gehel) [16:53:06] Changes to the settings in the CentralNotice tables wern't reflected in banners/campaigns being served [16:53:22] the mysql tables? [16:53:27] yeah [16:53:34] The CN DB data seems to have been all good (I checked the CN log tables) [16:53:41] Also checked about replication issues [16:53:48] I'm guessing those happen through some API that takes care of memcache, not as a direct mysql data mod? [16:54:28] The admin changes go directly to MySQL, but yeah, the info sent to clients about what campaigns go where goes through memcache [16:54:35] ObjectCache [16:55:04] Also maybe worthwhile to note that the people making the changes were in Europe? (or at least one of them was) [16:55:27] Supposedly the cache is purged every time there is an CN campaign settings change [16:55:31] so if the admin changes go direct to mysql, what invalidates the now-stale object cache entries? I don't think a direct mysql update would. [16:55:38] hmmm ok [16:55:44] No there's specific code that does that [16:55:57] (which could also have a bug... that'd be the other theory) [16:56:43] https://github.com/wikimedia/mediawiki-extensions-CentralNotice/blob/f6abdb6be74bb8620b19918dbff5c8d617f053d9/includes/ChoiceDataProvider.php#L18-L22 [16:57:10] Things started to work after admin users jiggled some unrelated settings [16:57:56] Which makes a caching issue on some level seem a feasible explanation [16:58:26] PROBLEM - Host mw2199 is DOWN: PING CRITICAL - Packet loss = 100% [16:58:26] PROBLEM - Host mw2193 is DOWN: PING CRITICAL - Packet loss = 100% [16:58:37] PROBLEM - Host mw2191 is DOWN: PING CRITICAL - Packet loss = 100% [16:58:37] PROBLEM - Host mw2192 is DOWN: PING CRITICAL - Packet loss = 100% [16:58:37] PROBLEM - Host mw2198 is DOWN: PING CRITICAL - Packet loss = 100% [16:58:37] PROBLEM - Host mw2195 is DOWN: PING CRITICAL - Packet loss = 100% [16:58:37] PROBLEM - Host mw2197 is DOWN: PING CRITICAL - Packet loss = 100% [16:58:37] PROBLEM - Host mw2194 is DOWN: PING CRITICAL - Packet loss = 100% [16:58:47] PROBLEM - Host mw2190 is DOWN: PING CRITICAL - Packet loss = 100% [16:58:47] PROBLEM - Host mw2196 is DOWN: PING CRITICAL - Packet loss = 100% [16:58:57] Yeah it's nothing horrible if the cache data is missing, but it can be bad if wrong cache data is being served [16:59:12] ^fixed, downtime too short [16:59:17] PROBLEM - puppet last run on elastic2006 is CRITICAL: CRITICAL: puppet fail [16:59:58] ^ elastic2006 is me, checking [17:00:05] yurik, gwicke, cscott, arlolra, subbu, halfak, and Amir1: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160901T1700). Please do the needful. [17:00:05] bd808: A patch you scheduled for Services – Graphoid / Parsoid / OCG / Citoid / ORES is about to be deployed. Please be available during the process. [17:00:15] nope [17:00:20] no parsoid deploy [17:00:26] no ORES [17:00:51] Yeah, it's time for my patch! :) [17:01:50] (03PS9) 10Andrew Bogott: Set up a root password for Labs instances [puppet] - 10https://gerrit.wikimedia.org/r/307086 (https://phabricator.wikimedia.org/T142531) [17:02:00] (03CR) 10Dzahn: [C: 031] "yea, it's just a comment line now, of course. but dont wanna restart gerrit service for this, which will happen after merge. so whenever t" [puppet] - 10https://gerrit.wikimedia.org/r/256663 (https://phabricator.wikimedia.org/T75997) (owner: 10Thiemo Mättig (WMDE)) [17:02:01] RECOVERY - puppet last run on elastic2006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:03:09] (03CR) 10Andrew Bogott: [C: 032] Set up a root password for Labs instances [puppet] - 10https://gerrit.wikimedia.org/r/307086 (https://phabricator.wikimedia.org/T142531) (owner: 10Andrew Bogott) [17:03:10] I've got a striker deploy to do [17:03:36] (03PS1) 10BBlack: text VCL: bad browser redirect: target IE8/XP more [puppet] - 10https://gerrit.wikimedia.org/r/307977 (https://phabricator.wikimedia.org/T118181) [17:04:49] Oops, my patch is in an hour. [17:05:57] (03CR) 10BBlack: [C: 032] text VCL: bad browser redirect: target IE8/XP more [puppet] - 10https://gerrit.wikimedia.org/r/307977 (https://phabricator.wikimedia.org/T118181) (owner: 10BBlack) [17:06:02] (03PS2) 10BBlack: text VCL: bad browser redirect: target IE8/XP more [puppet] - 10https://gerrit.wikimedia.org/r/307977 (https://phabricator.wikimedia.org/T118181) [17:06:07] (03CR) 10BBlack: [V: 032] text VCL: bad browser redirect: target IE8/XP more [puppet] - 10https://gerrit.wikimedia.org/r/307977 (https://phabricator.wikimedia.org/T118181) (owner: 10BBlack) [17:07:39] 06Operations: Instead of url forward for wikipedia.in ... add server ip. - https://phabricator.wikimedia.org/T144508#2602033 (10Naveenpf) [17:08:14] (03PS1) 10Andrew Bogott: Define labs_puppet_master for labs instances. [puppet] - 10https://gerrit.wikimedia.org/r/307980 (https://phabricator.wikimedia.org/T142531) [17:08:17] ACKNOWLEDGEMENT - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors Gehel hostgroup missing after the rename of the relforge cluster - gehel [17:09:36] (03CR) 10Andrew Bogott: [C: 032] Define labs_puppet_master for labs instances. [puppet] - 10https://gerrit.wikimedia.org/r/307980 (https://phabricator.wikimedia.org/T142531) (owner: 10Andrew Bogott) [17:10:03] (03PS1) 10MaxSem: Deploy Kartographer everywhere public [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307981 (https://phabricator.wikimedia.org/T144062) [17:10:20] (03PS1) 10Cmjohnson: Adding new mgmt dns entries for 3 new fundraising hosts, pay-lvs's frauth1001 [dns] - 10https://gerrit.wikimedia.org/r/307982 [17:10:54] (03CR) 10Cmjohnson: [C: 032] Adding new mgmt dns entries for 3 new fundraising hosts, pay-lvs's frauth1001 [dns] - 10https://gerrit.wikimedia.org/r/307982 (owner: 10Cmjohnson) [17:12:09] !log Updated striker to ac555bd; fixes T144064 [17:12:10] T144064: Tool Maintainers badly overcounted - https://phabricator.wikimedia.org/T144064 [17:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:14:50] bblack: sorry 4 the bother still, who might be directly responsible for memcached infrastructure? Or whom should I ask to find out? thx again!!! [17:15:31] (03PS1) 10BBlack: openssl (1.1.0-1+wmf2) experimental; urgency=medium [debs/openssl] (wmf-1.1) - 10https://gerrit.wikimedia.org/r/307983 [17:16:38] AndyRussG: well we all are in some sense, for the infra, but I don't understand our mc to that level, and I'm not sure who does. I would look first to tracking down where things went wrong with purging the stale data before looking for an infra outage. [17:17:04] AndyRussG: maybe _joe_ too, but he's on vacation [17:17:17] AndyRussG: can the data simply be purged again? [17:18:02] bblack: K... Yeah it can be purged as much as we like [17:18:28] In the code itself I don't see how it was possible that the cache wasn't purged [17:18:56] But someone who knows the infrastructure well might be able to shed some light [17:19:07] Maybe it was only purged in one datacenter? [17:20:46] (03PS1) 10Gehel: Create the relforge hostgroups now that relforge cluster has been renamed. [puppet] - 10https://gerrit.wikimedia.org/r/307985 [17:22:18] PROBLEM - Host ms-be1022 is DOWN: PING CRITICAL - Packet loss = 100% [17:22:22] np waiting until Monday to pester _joe_ I think (unless this starts to happen again :) ) [17:24:00] (03PS2) 10Gehel: Create the relforge hostgroups now that relforge cluster has been renamed. [puppet] - 10https://gerrit.wikimedia.org/r/307985 [17:24:10] RECOVERY - Host ms-be1022 is UP: PING OK - Packet loss = 0%, RTA = 1.32 ms [17:25:29] (03CR) 10Gehel: [C: 032] Create the relforge hostgroups now that relforge cluster has been renamed. [puppet] - 10https://gerrit.wikimedia.org/r/307985 (owner: 10Gehel) [17:27:44] PROBLEM - HP RAID on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [17:29:27] RECOVERY - MariaDB Slave Lag: s1 on db1047 is OK: OK slave_sql_lag Replication lag: 27.01 seconds [17:30:40] (03PS3) 10Dzahn: Add wikiquote.pl, link to parking [dns] - 10https://gerrit.wikimedia.org/r/254294 (owner: 10Odder) [17:31:31] (03PS1) 10Yuvipanda: Temp. Hack to get tools up and running [labs/private] - 10https://gerrit.wikimedia.org/r/307989 [17:32:11] 06Operations, 10ops-eqiad: Rack/setup sodium (carbon/mirror server replacement) - https://phabricator.wikimedia.org/T139171#2602170 (10Cmjohnson) Received the new disks from Dell, installed them set to RAID and the new disks are now handled by the BIOS. [17:32:28] RECOVERY - HP RAID on ms-be1016 is OK: OK: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2, Controller [17:33:01] (03CR) 10Rush: [C: 031] Temp. Hack to get tools up and running [labs/private] - 10https://gerrit.wikimedia.org/r/307989 (owner: 10Yuvipanda) [17:33:42] (03CR) 10Yuvipanda: [C: 032 V: 032] Temp. Hack to get tools up and running [labs/private] - 10https://gerrit.wikimedia.org/r/307989 (owner: 10Yuvipanda) [17:33:44] (03CR) 10Dzahn: "We are going to add the domain to our zones so that it exists properly but linked to the "parking" template so no traffic until the redire" [dns] - 10https://gerrit.wikimedia.org/r/254294 (owner: 10Odder) [17:34:12] (03PS4) 10Dzahn: Add wikiquote.pl, link to parking [dns] - 10https://gerrit.wikimedia.org/r/254294 (owner: 10Odder) [17:34:27] (03PS1) 10Andrew Bogott: vmbuilder: Upgrade openssh on firstboot [puppet] - 10https://gerrit.wikimedia.org/r/307991 [17:38:00] (03CR) 10Dzahn: [C: 032] Add wikiquote.pl, link to parking [dns] - 10https://gerrit.wikimedia.org/r/254294 (owner: 10Odder) [17:39:35] (03PS1) 10Jgreen: flip fundraising-read.wmnet db alias from db1008 to frdb1001 [dns] - 10https://gerrit.wikimedia.org/r/307992 [17:40:27] (03CR) 10Andrew Bogott: [C: 032] vmbuilder: Upgrade openssh on firstboot [puppet] - 10https://gerrit.wikimedia.org/r/307991 (owner: 10Andrew Bogott) [17:44:17] RECOVERY - Router interfaces on pfw-eqiad is OK: OK: host 208.80.154.218, interfaces up: 111, down: 0, dormant: 0, excluded: 2, unused: 0 [17:44:17] (03CR) 10Jgreen: [C: 032] flip fundraising-read.wmnet db alias from db1008 to frdb1001 [dns] - 10https://gerrit.wikimedia.org/r/307992 (owner: 10Jgreen) [17:46:00] (03PS1) 10Andrew Bogott: Rename the labs instance setting labs_puppet_master. [puppet] - 10https://gerrit.wikimedia.org/r/307994 [17:47:25] (03CR) 10Thcipriani: [C: 031] "I would be less tempted to override this, I think." [puppet] - 10https://gerrit.wikimedia.org/r/307994 (owner: 10Andrew Bogott) [17:47:49] (03CR) 10Andrew Bogott: [C: 032] Rename the labs instance setting labs_puppet_master. [puppet] - 10https://gerrit.wikimedia.org/r/307994 (owner: 10Andrew Bogott) [17:47:57] (03Abandoned) 10Jgreen: flip fundraising-read.wmnet db alias from db1008 to frdb1001 [dns] - 10https://gerrit.wikimedia.org/r/307992 (owner: 10Jgreen) [17:49:47] (03PS3) 10Eevans: Simplification of Cassandra Logstash filtering [puppet] - 10https://gerrit.wikimedia.org/r/282466 (https://phabricator.wikimedia.org/T130861) (owner: 10Jstenval) [17:51:30] PROBLEM - Tool Labs instance distribution on labcontrol1001 is CRITICAL: CRITICAL: static class instances not spread out enough [17:51:30] PROBLEM - Tool Labs instance distribution on labcontrol1002 is CRITICAL: CRITICAL: static class instances not spread out enough [17:51:50] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: onboarding Manuel Arostegui in ops - https://phabricator.wikimedia.org/T144469#2602203 (10RobH) [17:51:52] 06Operations, 06WMF-NDA-Requests: Phabricator access for Manuel Arostegui to WMF-NDA and Ops-only procurement and other tasks - https://phabricator.wikimedia.org/T144496#2602200 (10RobH) 05Open>03Resolved a:03RobH RobH added a member for acl*operations-team: Marostegui. Thu, Sep 1, 17:50 I've updated t... [17:52:23] (03PS1) 10Jgreen: flip fundraising-read.wmnet alias from db1008 to frdb1001 [dns] - 10https://gerrit.wikimedia.org/r/307995 [17:54:19] PROBLEM - HP RAID on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [17:54:19] RECOVERY - Tool Labs instance distribution on labcontrol1001 is OK: OK: All critical toollabs instances are spread out enough [17:54:19] RECOVERY - Tool Labs instance distribution on labcontrol1002 is OK: OK: All critical toollabs instances are spread out enough [17:54:49] (03CR) 10Jgreen: [C: 032] flip fundraising-read.wmnet alias from db1008 to frdb1001 [dns] - 10https://gerrit.wikimedia.org/r/307995 (owner: 10Jgreen) [17:55:35] !log switching fundraising database reader from db1008 to frdb1001 [17:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:57:49] (03PS4) 10Eevans: Simplification of Cassandra Logstash filtering [puppet] - 10https://gerrit.wikimedia.org/r/282466 (https://phabricator.wikimedia.org/T130861) (owner: 10Jstenval) [17:58:38] PROBLEM - puppet last run on elastic2020 is CRITICAL: CRITICAL: Puppet has 1 failures [17:59:38] RECOVERY - HP RAID on ms-be1016 is OK: OK: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2, Controller [17:59:44] ^ elastic2020 checked, transient error, all is fine [18:00:05] anomie, ostriches, thcipriani, hashar, twentyafterfour, and aude: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160901T1800). Please do the needful. [18:00:05] Niharika, dcausse, and jgirault: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:31] Hullo. [18:00:34] o/ [18:01:37] RECOVERY - puppet last run on elastic2020 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:01:55] I can SWAT today [18:02:01] 06Operations, 10ops-eqiad, 10media-storage: diagnose failed(?) sda on ms-be1022 - https://phabricator.wikimedia.org/T140597#2602231 (10Cmjohnson) @fgiunchedi the new disk showed and I replaced the one that was producing errors...which was /dev/sdb afaik. Please check and lmk how it goes. [18:03:11] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307927 (https://phabricator.wikimedia.org/T142056) (owner: 10Niharika29) [18:03:38] (03Merged) 10jenkins-bot: Test PageAssessments on English Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307927 (https://phabricator.wikimedia.org/T142056) (owner: 10Niharika29) [18:04:16] Niharika: patch is live on mw1099 if you have anything to test there [18:04:36] * Niharika checks [18:04:45] (03CR) 10Eevans: "[Puppet-compiler output here](http://puppet-compiler.wmflabs.org/3911/)" [puppet] - 10https://gerrit.wikimedia.org/r/282466 (https://phabricator.wikimedia.org/T130861) (owner: 10Jstenval) [18:04:59] (03PS5) 10Eevans: Simplification of Cassandra Logstash filtering [puppet] - 10https://gerrit.wikimedia.org/r/282466 (https://phabricator.wikimedia.org/T130861) (owner: 10Jstenval) [18:05:34] thcipriani: Looks fine to me. [18:05:43] Niharika: ack, syncing everywhere [18:08:09] is mw2187.codfw.wmnet having problems? [18:08:49] Is that a question for me? [18:09:19] Niharika: no, for an opsen/root to check out [18:09:25] not letting me ssh to that machine from tin [18:09:29] Ah, okay. [18:11:28] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:307927|Test PageAssessments on English Wikivoyage (T142056)]] (duration: 04m 54s) [18:11:29] T142056: Test PageAssessements on English Wikivoyage - https://phabricator.wikimedia.org/T142056 [18:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:12:55] ^ Niharika I'm going to need to resync that once we get mw2187 working and/or removed from our server list, sorry :\ [18:13:10] thcipriani: No worries. Thank you. [18:14:23] thcipriani: uhh, is it half-deployed right now? I don't think it's a good idea to keep an extension half deployed...could cause weird stuff [18:15:33] legoktm: you're right. I'll revert locally until this is resolved [18:19:10] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: REVERT because proxy down SWAT: [[gerrit:307927|Test PageAssessments on English Wikivoyage (T142056)]] (duration: 03m 15s) [18:19:10] !log gehel@palladium conftool action : set/pooled=yes; selector: name=elastic1029.eqiad.wmnet [18:19:11] T142056: Test PageAssessements on English Wikivoyage - https://phabricator.wikimedia.org/T142056 [18:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:19:32] ^ should be reverted locally so we're not in a half deployed state [18:19:55] (03PS6) 10Eevans: Simplification of Cassandra Logstash filtering [puppet] - 10https://gerrit.wikimedia.org/r/282466 (https://phabricator.wikimedia.org/T130861) (owner: 10Jstenval) [18:21:11] thcipriani: yeah, but now deployed code != repo code [18:21:57] Dereckson: temporarily, until this is figured out [18:23:07] (03PS1) 10Andrew Bogott: Revert "Pare down the cloud-init commands on precise" [puppet] - 10https://gerrit.wikimedia.org/r/307999 [18:24:30] (03CR) 10Andrew Bogott: [C: 032] Revert "Pare down the cloud-init commands on precise" [puppet] - 10https://gerrit.wikimedia.org/r/307999 (owner: 10Andrew Bogott) [18:26:40] !log mw2187 - powercycled [18:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:29:22] mw2187 is back up [18:29:28] and started its services [18:30:23] The last Puppet run was at Thu Sep 1 14:30:42 UTC 2016 (238 minutes ago). [18:32:05] Wwhen servers are rebooted, they automatically retrieve newest code through a scap pull or should we do it manually to avoid inconsistency? [18:32:26] they need a scap pull [18:32:34] Dereckson: the latter please [18:32:48] i am running puppet [18:32:54] puppet only does the pull if the directory is completely missing [18:33:32] yea, could you please sync it [18:33:35] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:307927|Test PageAssessments on English Wikivoyage (T142056)]] (duration: 02m 48s) [18:33:36] T142056: Test PageAssessements on English Wikivoyage - https://phabricator.wikimedia.org/T142056 [18:33:40] looks like it was down for 3h [18:33:50] of course it's ok if nothing changed since then [18:34:09] mutante: uhhh. There are like 32 other servers that may be down. I thought they were just having trouble connecting to that proxy/there was a bug with the scap code [18:34:42] the 32 down hosts in icinga are all ACKed [18:34:47] just 1 one was not [18:34:58] are they out of dsh? [18:35:03] no [18:35:10] this again [18:35:31] they are all codfw and in scheduled downtimes [18:35:37] says icinga [18:35:46] so they should be removed from dsh, yes? [18:35:53] haven't we had this conversation before? :) [18:35:55] from Sep 1 to Sep 2 [18:35:59] thcipriani: Niharika's checking out (since it's past midnight her time), but I can take over for her for the PageAssessments testing. [18:36:24] 06Operations, 10Traffic, 13Patch-For-Review: Support TLS chacha20-poly1305 AEAD ciphers - https://phabricator.wikimedia.org/T131908#2602326 (10BBlack) This ticket has dragged on and wavered off-topic considerably. We've been supporting chapoly ciphers for about a month now, so the base issue here is resolve... [18:36:30] 06Operations, 10Traffic, 13Patch-For-Review: Support TLS chacha20-poly1305 AEAD ciphers - https://phabricator.wikimedia.org/T131908#2602327 (10BBlack) 05Open>03Resolved a:03BBlack [18:36:36] yea, i dunno about that downtime though [18:36:44] doesnt say which user added it [18:36:55] kaldari: ack. Thank you. My understanding right now is that all the servers that don't have the new code are down for maintenance so it should be deployed everywhere. [18:37:03] depooled from pybal should equal no getting in scap's way [18:37:38] so, scap needs to talk to etcd/conftool, which is on our plan :) [18:37:42] (I guess is the right way?) [18:37:43] fwiw _joe._ is working on a thing that does that/maybe already merged that thing and it's not working as expected [18:37:52] * greg-g nods [18:39:27] thcipriani: Cool, it does appear to be live on En WikiVoyage (which is our target). I'll test there. [18:40:02] thcipriani: do these 32 others block your deployment? or was it just that a scap-proxy was down that blocked you [18:42:06] mutante: not "blocking" per-se, but if they're in that dsh file then the deployment has to wait for ssh to timeout on those hosts, so a deploy that takes 45 seconds takes 3 minutes [18:43:02] which is a bummer and freaks me out and is pretty avoidable [18:43:05] greg-g: eh.. there are no dsh files anymore actually [18:43:36] thcipriani: Seems to be working well! [18:43:38] they are generated and things changed ... [18:43:47] looking [18:44:00] kaldari: awesome, thanks for checking :) [18:44:03] * greg-g hangs head [18:44:14] ah, so that thing did merge likely [18:44:37] there was that "generate dsh files" ticket [18:45:13] https://phabricator.wikimedia.org/T80395 but i meant.. eh [18:45:23] this https://phabricator.wikimedia.org/rOPUPaf3047ecdad79125a334c798fba0606ab7fba330 [18:45:40] goes to hiera [18:46:00] eh, I'm trying to find the more recent thing from _jo.e_ from like a week ago [18:46:32] i found the dsh.yaml file, but i do _not_ see those down hosts in there [18:47:10] the hosts that are down/acked in icinga are not in it [18:48:08] https://phabricator.wikimedia.org/rOPUP331eb24b94ab2453451d8c1fc511ec6c7f51b2fe [18:49:00] conftool-data/nodes/codfw.yaml: mw2167.codfw.wmnet: [apache2] [18:49:03] that [18:49:12] ah [18:49:18] let me remove them there [18:49:52] if one ever looks for the conftool-data, I have added the link to https://noc.wikimedia.org/ :D [18:50:33] cool, thanks [18:50:42] yeah I added the noc main page to my handy bookmarks, it's lookin pretty good [18:51:02] we also had a bunch of servers reimaged over the day [18:52:26] hrm, looks like they're all enabled: False there https://config-master.wikimedia.org/conftool/codfw/apaches [18:53:16] maybe tin just needs a puppet run to update the dsh files? Or maybe scap_proxies is not being generated anymore and it just unmanaged now. [18:53:35] (03PS1) 10Dzahn: remove mw2167 thru mw2199 from conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/308004 [18:56:07] (03CR) 10Gergő Tisza: Do not use $wgExtensionFunctions to set globals (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307893 (https://phabricator.wikimedia.org/T143055) (owner: 10Gergő Tisza) [18:57:10] (03CR) 10Dzahn: [C: 032] "they are also all disabled here https://config-master.wikimedia.org/conftool/codfw/apaches to be reverted after maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/308004 (owner: 10Dzahn) [18:58:04] 06Operations, 06Labs: Enable root passwords on Labs VMs - https://phabricator.wikimedia.org/T142216#2602444 (10Andrew) 05Open>03Resolved This is done now. Passwords are per-project, and located in var/local/labs-root-passwords/ on the labs puppetmaster (currently labcontrol1001). [18:58:46] wonder if there's something wrong here? https://github.com/wikimedia/operations-puppet/blob/production/modules/scap/templates/dsh/dsh-group.erb#L15 [18:59:14] or possibly where they're defined in https://github.com/wikimedia/operations-puppet/blob/production/hieradata/common/scap/dsh.yaml#L19 [18:59:18] thcipriani: ehmm.. that isnt merged yet, and now i'm not sure [18:59:22] reading wikitech [18:59:29] that step is only for actual "decom" [18:59:49] but the scheduled downtime is just for 24h [19:00:04] hashar: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160901T1900). [19:00:14] (03CR) 10Eevans: [C: 031] "I tested this in deployment-prep, and it seems to work as advertised." [puppet] - 10https://gerrit.wikimedia.org/r/282466 (https://phabricator.wikimedia.org/T130861) (owner: 10Jstenval) [19:00:20] what isn't merged yet? [19:00:35] to remove the server names from conftool data files in puppet repo [19:00:51] (as opposed to just running commands to depool them) [19:02:16] o/ [19:02:39] jouncebot: delay [19:02:44] mutante: ah yeah I think there's some issue with the dynamic writing of the dsh files, somewhere in one of the files I linked, I'm not sure about removing them either, but this is causing problems :\ [19:03:44] they are not in dsh.yaml [19:03:47] is puppet querying conftool to generate the dsh files ? [19:03:54] that's where i was before [19:04:44] let's try removing them from conftool-data i guess (even though the docs say that only for permanent decom) [19:04:57] hashar: yarp https://github.com/wikimedia/operations-puppet/blob/production/modules/scap/templates/dsh/dsh-group.erb#L13 [19:05:00] unless you can survive like this for 24h [19:05:09] then the scheduled downtime would be over [19:05:21] I have to do the train deploy to group2 now [19:06:04] ok, i submitted https://gerrit.wikimedia.org/r/#/c/308004/ [19:06:21] dcausse: jgirault_ Going to have to push these patches to evening SWAT, sorry for the technical difficulties :( [19:06:34] or kills puppet on tin and manually update the dsh files [19:06:43] thcipriani: no problem, thanks! [19:07:15] i ran puppet-merge [19:07:29] there was an error running conftool which is started by that :/ [19:08:48] runs puppet on tin [19:08:56] mutante: based on giuseppe commit 331eb24b94ab2453451d8c1fc511ec6c7f51b2fe it should skip servers marked 'inactive' [19:09:08] codfw: Removing node mw2187.codfw.wmnet from cluster appserver/apache2 [19:09:11] ERROR:conftool:delete_node Backend error while deleting node: Backend error: The request requires user authentication : Insufficient credentials [19:09:14] that's an env thing [19:09:28] pupet-merge was supposed to trigger conftool actions [19:09:32] thcipriani ok, thanks! [19:09:37] for "delete_node" [19:11:16] hashar: thcipriani: i edited the dsh file in tin directly [19:11:29] neat [19:11:35] { 'host': 'mw2183.codfw.wmnet', 'weight':20, 'enabled': False } [19:11:43] !log tin removing mw2167 thru mw2199 from dsh file manually, re-running puppet [19:11:44] hosts.select{ |x| x['value']['pooled'] != 'inactive' } [19:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:11:57] so yeah seems we flag them 'enabled': false [19:12:09] but the dsh file is based on 'pooled' == 'inactive' ? :( [19:12:14] hashar: yeah but this is the output from confctl which may be different? [19:12:22] yeah [19:12:50] also: none of those dsh files look like they're coming from this function, they don't have comments saying they're managed by puppet or anything [19:13:06] running puppet re-adds the hosts, we have to disable puppet on tin [19:13:20] until the deploy is over [19:13:21] oh no, they do, just the proxies and stuff :) [19:14:11] !log tin temp. disabled puppet [19:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:14:34] thcipriani: ok, so, i have to go but i want to help.. i removed them again from /etc/dsh/group/mediawiki-installation directly, disabled puppet [19:14:37] hashar: ^ [19:14:44] once i'm back i would re-enable it [19:14:57] just have to pick somebody up right now [19:15:05] thcipriani: anything left to do for SWAT ? [19:15:06] mutante: ok, sounds good, thank you for your help, this should tide us over [19:15:12] mutante: thank you !!!! [19:15:14] hashar: no, I bumped everything [19:15:18] be back soon [19:15:24] ok rolling the train so [19:15:35] which a five years old can do https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Thursday:_group.7B0.2C1.7D_to_all_deploy [19:19:31] !log hashar@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.28.0-wmf.17 [19:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:20:14] (03PS1) 10Ottomata: Reportupdater should rsync to ::srv module instead of :www [puppet] - 10https://gerrit.wikimedia.org/r/308008 (https://phabricator.wikimedia.org/T144278) [19:20:42] SELECT /* CategoryMembershipChangeJob::run 127.0.0.1 */ GET_LOCK('CategoryMembershipUpdates:XXXXX', 10) AS lockstatus [19:20:43] really [19:20:53] I am wondering why we use the database as a lock manager :/ [19:22:47] (03CR) 10Ottomata: [C: 032] Reportupdater should rsync to ::srv module instead of :www [puppet] - 10https://gerrit.wikimedia.org/r/308008 (https://phabricator.wikimedia.org/T144278) (owner: 10Ottomata) [19:24:57] 06Operations, 10Traffic: OpenSSL 1.1 deployment for cache clusters - https://phabricator.wikimedia.org/T144523#2602609 (10BBlack) [19:25:29] 06Operations, 10DBA, 06Performance-Team, 10Traffic, and 2 others: Apache <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809#2602624 (10aaron) [19:25:32] 06Operations, 06Performance-Team, 07Availability, 05MW-1.28-release-notes, and 2 others: Audit mysql database class and hhvm binding support of SSL - https://phabricator.wikimedia.org/T136218#2602623 (10aaron) 05Open>03Resolved [19:36:34] (03PS1) 10BBlack: Revert "package::builder: EXPERIMENTAL=yes support" [puppet] - 10https://gerrit.wikimedia.org/r/308010 [19:36:44] (03PS2) 10BBlack: Revert "package::builder: EXPERIMENTAL=yes support" [puppet] - 10https://gerrit.wikimedia.org/r/308010 [19:36:57] (03CR) 10BBlack: [C: 032 V: 032] Revert "package::builder: EXPERIMENTAL=yes support" [puppet] - 10https://gerrit.wikimedia.org/r/308010 (owner: 10BBlack) [19:39:09] !log T143226: Clearing repair status restbase1011-c.eqiad.wmnet [19:39:10] T143226: Cluster-wide major compactions: parsoid-html - https://phabricator.wikimedia.org/T143226 [19:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:39:45] hashar: global distributed lock ? :) [19:44:03] !log T143226: Clearing repair status: eqiad, rack 'b' nodes [19:44:04] T143226: Cluster-wide major compactions: parsoid-html - https://phabricator.wikimedia.org/T143226 [19:44:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:45:10] hashar: lemme know how this deploy goes, i'm watching stuff [19:54:04] !log reloading ferm rules on elasticsearch eqiad cluster [19:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:58:20] !log 1.28.0-wmf.17 rolled on group2 and apparently all fine [19:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:58:32] ottomata: I have pushed it roughly 40 minutes ago and it looks all fine [20:01:06] nice hashar cool [20:01:12] ya looks find from here so far too [20:01:16] Pchelolo: ^ [20:01:18] fine* [20:01:57] ottomata: I am still around a bit watching it [20:02:02] else poke #wikimedia-releng :] [20:02:22] ok great thanks [20:08:47] (03PS1) 10ArielGlenn: add timeout and related callback to method for running proc without output [dumps] - 10https://gerrit.wikimedia.org/r/308015 [20:08:49] (03PS1) 10ArielGlenn: fix up locking for misc dumps [dumps] - 10https://gerrit.wikimedia.org/r/308016 [20:09:02] (03CR) 10jenkins-bot: [V: 04-1] add timeout and related callback to method for running proc without output [dumps] - 10https://gerrit.wikimedia.org/r/308015 (owner: 10ArielGlenn) [20:09:09] (03CR) 10jenkins-bot: [V: 04-1] fix up locking for misc dumps [dumps] - 10https://gerrit.wikimedia.org/r/308016 (owner: 10ArielGlenn) [20:09:13] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 03Discovery-Wikidata-Query-Service-Sprint: Move data storage to /srv/wdqs/ on codfw WDQS nodes - https://phabricator.wikimedia.org/T144536#2602875 (10Smalyshev) [20:10:44] of course. I thought I pep8 that thing [20:11:12] oh both are whining. meeeehh [20:12:04] (03PS1) 10Cmjohnson: Adding mkroetzsch to analytics private data users group [puppet] - 10https://gerrit.wikimedia.org/r/308017 [20:13:02] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 03Discovery-Wikidata-Query-Service-Sprint: Make puppet generate path config for WDQS nodes - https://phabricator.wikimedia.org/T144537#2602893 (10Smalyshev) [20:13:13] (03PS2) 10ArielGlenn: add timeout and related callback to method for running proc without output [dumps] - 10https://gerrit.wikimedia.org/r/308015 [20:13:43] (03CR) 10jenkins-bot: [V: 04-1] Adding mkroetzsch to analytics private data users group [puppet] - 10https://gerrit.wikimedia.org/r/308017 (owner: 10Cmjohnson) [20:13:51] (03PS2) 10Cmjohnson: Adding mkroetzsch and akrausetud to analytics private data users group [puppet] - 10https://gerrit.wikimedia.org/r/308017 [20:14:18] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Move data storage to /srv/wdqs/ on codfw WDQS nodes - https://phabricator.wikimedia.org/T144536#2602914 (10Smalyshev) a:03Gehel [20:14:41] (03PS2) 10ArielGlenn: fix up locking for misc dumps [dumps] - 10https://gerrit.wikimedia.org/r/308016 [20:15:10] (03PS3) 10Cmjohnson: Adding mkroetzsch and akrausetud to analytics private data users group [puppet] - 10https://gerrit.wikimedia.org/r/308017 [20:15:50] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 03Discovery-Wikidata-Query-Service-Sprint: Remove /srv/deployment/wdqs/wdqs/rules.log symlink - https://phabricator.wikimedia.org/T144539#2602931 (10Smalyshev) [20:16:13] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 03Discovery-Wikidata-Query-Service-Sprint: Remove /srv/deployment/wdqs/wdqs/rules.log symlink - https://phabricator.wikimedia.org/T144539#2602931 (10Smalyshev) Low priority - we can live with the link for now, it's just not clean :) [20:17:11] (03PS1) 10Legoktm: Add basic debug logging functionality [software/service-checker] - 10https://gerrit.wikimedia.org/r/308019 [20:17:13] (03PS1) 10Legoktm: Add x-default-query functionality [software/service-checker] - 10https://gerrit.wikimedia.org/r/308020 [20:17:40] (03CR) 10Cmjohnson: [C: 032] Adding mkroetzsch and akrausetud to analytics private data users group [puppet] - 10https://gerrit.wikimedia.org/r/308017 (owner: 10Cmjohnson) [20:17:59] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: Remove /srv/deployment/wdqs/wdqs/rules.log symlink - https://phabricator.wikimedia.org/T144539#2602963 (10Smalyshev) [20:19:51] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 10Research-collaborations, 10Research-management: Request access to data for WDQS research - https://phabricator.wikimedia.org/T142780#2603004 (10Cmjohnson) Patchset has been merged for both users. You should have access at this time. It may take... [20:34:31] (03PS1) 10Mobrovac: service::node: Compile the file holding puppet-controlled vars [puppet] - 10https://gerrit.wikimedia.org/r/308021 (https://phabricator.wikimedia.org/T144542) [20:34:58] 06Operations, 10DBA, 06Labs, 07Tracking: Database replication services (tracking) - https://phabricator.wikimedia.org/T50930#2603094 (10Neil_P._Quinn_WMF) [20:35:34] !log T143226: Clearing repair status: eqiad, rack 'dd' nodes [20:35:35] T143226: Cluster-wide major compactions: parsoid-html - https://phabricator.wikimedia.org/T143226 [20:35:38] !log T143226: Clearing repair status: eqiad, rack 'd' nodes [20:35:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:35:39] T143226: Cluster-wide major compactions: parsoid-html - https://phabricator.wikimedia.org/T143226 [20:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:39:50] (03CR) 10Anomie: Do not use $wgExtensionFunctions to set globals (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307893 (https://phabricator.wikimedia.org/T143055) (owner: 10Gergő Tisza) [20:40:16] (03PS2) 10Mobrovac: service::node: Compile the file holding puppet-controlled vars [puppet] - 10https://gerrit.wikimedia.org/r/308021 (https://phabricator.wikimedia.org/T144542) [20:42:37] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 10Research-collaborations, 10Research-management: Request access to data for WDQS research - https://phabricator.wikimedia.org/T142780#2603127 (10Cmjohnson) 05Open>03Resolved [20:47:42] (03PS1) 10Gehel: wdqs - move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/308023 (https://phabricator.wikimedia.org/T144536) [20:47:48] (03PS3) 10Gergő Tisza: Do not use $wgExtensionFunctions to set globals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307893 (https://phabricator.wikimedia.org/T143055) [20:48:58] (03PS4) 10Gergő Tisza: Do not use $wgExtensionFunctions to set globals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307893 (https://phabricator.wikimedia.org/T143055) [20:49:58] (03CR) 10jenkins-bot: [V: 04-1] wdqs - move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/308023 (https://phabricator.wikimedia.org/T144536) (owner: 10Gehel) [20:50:18] (03CR) 10Gergő Tisza: Do not use $wgExtensionFunctions to set globals (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307893 (https://phabricator.wikimedia.org/T143055) (owner: 10Gergő Tisza) [20:54:23] (03PS2) 10Gehel: wdqs - move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/308023 (https://phabricator.wikimedia.org/T144536) [20:54:38] 06Operations, 06Labs, 13Patch-For-Review: Phase out the 'puppet' module with fire, make self hosted puppetmasters use the puppetmaster module - https://phabricator.wikimedia.org/T120159#2603156 (10yuvipanda) wikidata-query done. [20:55:51] (03CR) 10Gehel: "Puppet compiler: https://puppet-compiler.wmflabs.org/3914/" [puppet] - 10https://gerrit.wikimedia.org/r/308023 (https://phabricator.wikimedia.org/T144536) (owner: 10Gehel) [20:58:41] (03CR) 10Smalyshev: "modules/wdqs/templates/initscripts/wdqs-updater.systemd.erb and modules/wdqs/templates/initscripts/wdqs-updater.upstart.erb need to be upd" [puppet] - 10https://gerrit.wikimedia.org/r/308023 (https://phabricator.wikimedia.org/T144536) (owner: 10Gehel) [20:58:55] !log mw2187 - shut down [20:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:59:01] (03CR) 10Smalyshev: [C: 04-1] wdqs - move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/308023 (https://phabricator.wikimedia.org/T144536) (owner: 10Gehel) [21:01:19] (03PS1) 10Dzahn: Revert "remove mw2167 thru mw2199 from conftool-data" [puppet] - 10https://gerrit.wikimedia.org/r/308024 [21:03:26] (03PS3) 10Rush: labstore: nfs-exports-daemon on all servers in a cluster [puppet] - 10https://gerrit.wikimedia.org/r/307629 (https://phabricator.wikimedia.org/T126083) [21:04:06] (03PS4) 10Rush: labstore: run nfs-exports-daemon on all servers in a cluster [puppet] - 10https://gerrit.wikimedia.org/r/307629 (https://phabricator.wikimedia.org/T126083) [21:06:50] (03CR) 10Madhuvishy: [C: 031] labstore: run nfs-exports-daemon on all servers in a cluster [puppet] - 10https://gerrit.wikimedia.org/r/307629 (https://phabricator.wikimedia.org/T126083) (owner: 10Rush) [21:11:06] (03PS2) 10Dzahn: Revert "remove mw2167 thru mw2199 from conftool-data" [puppet] - 10https://gerrit.wikimedia.org/r/308024 [21:11:11] (03CR) 10Dzahn: [C: 032] Revert "remove mw2167 thru mw2199 from conftool-data" [puppet] - 10https://gerrit.wikimedia.org/r/308024 (owner: 10Dzahn) [21:16:43] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 714 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5107082 keys - replication_delay is 714 [21:18:37] (03PS10) 1020after4: WIP: Scap swat command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) [21:21:26] !log T143226: Perform major compaction on local_group_wikipedia_T_parsoid_html.data, restbase1007-a [21:21:27] T143226: Cluster-wide major compactions: parsoid-html - https://phabricator.wikimedia.org/T143226 [21:21:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:21:58] !log T143226: Perform major compaction on local_group_wikipedia_T_parsoid_html.data, restbase1010-a [21:21:59] T143226: Cluster-wide major compactions: parsoid-html - https://phabricator.wikimedia.org/T143226 [21:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:22:20] (03PS11) 1020after4: WIP: Scap swat command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) [21:22:23] !log T143226: Perform major compaction on local_group_wikipedia_T_parsoid_html.data, restbase1011-a [21:22:24] T143226: Cluster-wide major compactions: parsoid-html - https://phabricator.wikimedia.org/T143226 [21:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:22:36] (03CR) 10MaxSem: WIP: Scap swat command (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) (owner: 1020after4) [21:24:55] !log T143226: Perform major compaction on local_group_wikipedia_T_parsoid_html.data, restbase1008-a [21:24:56] T143226: Cluster-wide major compactions: parsoid-html - https://phabricator.wikimedia.org/T143226 [21:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:25:26] !log T143226: Perform major compaction on local_group_wikipedia_T_parsoid_html.data, restbase1012-a [21:25:27] T143226: Cluster-wide major compactions: parsoid-html - https://phabricator.wikimedia.org/T143226 [21:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:25:39] (03PS12) 1020after4: WIP: Scap swat command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) [21:25:40] !log T143226: Perform major compaction on local_group_wikipedia_T_parsoid_html.data, restbase1013-a [21:25:41] T143226: Cluster-wide major compactions: parsoid-html - https://phabricator.wikimedia.org/T143226 [21:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:26:05] (03CR) 1020after4: WIP: Scap swat command (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) (owner: 1020after4) [21:28:54] (03CR) 10BBlack: [C: 032] openssl (1.1.0-1+wmf2) experimental; urgency=medium [debs/openssl] (wmf-1.1) - 10https://gerrit.wikimedia.org/r/307983 (owner: 10BBlack) [21:39:27] (03PS2) 10Hashar: Use task to run modules spec [puppet] - 10https://gerrit.wikimedia.org/r/307223 [21:39:54] (03CR) 10Mobrovac: "PS2 PCC looking good - https://puppet-compiler.wmflabs.org/3913/" [puppet] - 10https://gerrit.wikimedia.org/r/308021 (https://phabricator.wikimedia.org/T144542) (owner: 10Mobrovac) [21:41:02] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5062622 keys - replication_delay is 0 [21:41:13] (03CR) 10Hashar: "Bah 'rake' has somehow disappeared :(" [puppet] - 10https://gerrit.wikimedia.org/r/307223 (owner: 10Hashar) [21:41:44] ACKNOWLEDGEMENT - Host wtp2016 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T144260 [21:42:15] 06Operations, 10ops-codfw: Broken disk on wtp2016 - https://phabricator.wikimedia.org/T144260#2603297 (10Dzahn) 05Resolved>03Open re-opening because the host is marked as DOWN in Icinga [22:00:04] MaxSem: Respected human, time to deploy Kartogrpaher deployment to more wikis (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160901T2200). Please do the needful. [22:00:10] (03PS5) 10Rush: labstore: run nfs-exports-daemon on all servers in a cluster [puppet] - 10https://gerrit.wikimedia.org/r/307629 (https://phabricator.wikimedia.org/T126083) [22:00:10] jouncebot, next [22:00:10] In 0 hour(s) and 59 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160901T2300) [22:00:20] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 10Research-collaborations, 10Research-management: Request access to data for WDQS research - https://phabricator.wikimedia.org/T142780#2603387 (10leila) thanks a lot, @Cmjohnson. [22:00:25] lies. [22:00:34] my window [22:00:47] (03PS1) 10Ppchelko: Change-Prop: Switch to new events. [puppet] - 10https://gerrit.wikimedia.org/r/308077 [22:00:59] (03PS2) 10MaxSem: Deploy Kartographer everywhere public [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307981 (https://phabricator.wikimedia.org/T144062) [22:01:16] MaxSem: "next" at 3:00 is SWAT, it doesn't have "now" [22:01:18] jouncebot: now [22:01:35] :P [22:01:44] (03CR) 10MaxSem: [C: 032] Deploy Kartographer everywhere public [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307981 (https://phabricator.wikimedia.org/T144062) (owner: 10MaxSem) [22:02:11] (03Merged) 10jenkins-bot: Deploy Kartographer everywhere public [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307981 (https://phabricator.wikimedia.org/T144062) (owner: 10MaxSem) [22:03:13] jgirault, ready? :) [22:03:40] pulled on mw1099 [22:03:46] !! [22:05:25] ooops [22:05:38] this enables mapframe too [22:06:07] yeah :/ [22:06:21] * MaxSem scratches head [22:08:41] > echo json_encode($wgKartographerEnableTags); [22:08:41] ["mapframe","maplink","maplink"] [22:09:55] legoktm, is there a way to override the value of an array from extension.json, as opposed to merging? [22:11:30] 06Operations, 10ops-eqiad: decom ytterbium (datacenter) - https://phabricator.wikimedia.org/T141415#2497714 (10demon) I think a month is plenty of grace period, let's kill this thing. [22:13:34] (03PS2) 10Ppchelko: Change-Prop: Switch to new events. [puppet] - 10https://gerrit.wikimedia.org/r/308077 [22:25:09] MaxSem: foo => true, bar => false [22:25:28] legoktm, https://gerrit.wikimedia.org/r/308082 [22:25:53] I'm pretty sure I tested it though [22:25:59] :confused: [22:26:14] that also works [22:26:34] can you +2 it then? :p [22:29:15] (03PS1) 10MaxSem: $wgKartographerEnableTags --> $wgKartographerEnableMapFrame [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308084 [22:29:51] MaxSem I can +2 it [22:29:53] looks good [22:30:09] already reviewed [22:35:31] dear holy Zuul... [22:36:06] (03PS1) 10BryanDavis: Add a "now" command [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/308086 [22:36:08] (03CR) 10MaxSem: [C: 032] $wgKartographerEnableTags --> $wgKartographerEnableMapFrame [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308084 (owner: 10MaxSem) [22:36:25] (03PS1) 10BryanDavis: Use normal messages rather than notices for help [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/308087 [22:36:35] (03Merged) 10jenkins-bot: $wgKartographerEnableTags --> $wgKartographerEnableMapFrame [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308084 (owner: 10MaxSem) [22:36:47] (03CR) 10Dzahn: "yes, desired feature indeed :)" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/308086 (owner: 10BryanDavis) [22:53:26] (03PS1) 10MaxSem: Revert "Deploy Kartographer everywhere public" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308088 (https://phabricator.wikimedia.org/T144062) [22:53:30] (03CR) 10MaxSem: [C: 032] Revert "Deploy Kartographer everywhere public" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308088 (https://phabricator.wikimedia.org/T144062) (owner: 10MaxSem) [22:55:21] PROBLEM - puppet last run on mw2202 is CRITICAL: CRITICAL: puppet fail [22:56:07] (03PS2) 10MaxSem: Revert "Deploy Kartographer everywhere public" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308088 (https://phabricator.wikimedia.org/T144062) [22:56:20] PROBLEM - puppet last run on ms-be1022 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [22:56:27] (03CR) 10MaxSem: Revert "Deploy Kartographer everywhere public" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308088 (https://phabricator.wikimedia.org/T144062) (owner: 10MaxSem) [22:56:31] (03CR) 10MaxSem: [C: 032] Revert "Deploy Kartographer everywhere public" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308088 (https://phabricator.wikimedia.org/T144062) (owner: 10MaxSem) [22:57:00] (03Merged) 10jenkins-bot: Revert "Deploy Kartographer everywhere public" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308088 (https://phabricator.wikimedia.org/T144062) (owner: 10MaxSem) [23:00:04] RoanKattouw, ostriches, MaxSem, awight, and Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160901T2300). [23:00:04] ebernhardson, Dereckson, and RoanKattouw: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:18] 1 sec, I'm wrapping up [23:00:52] \o [23:00:59] where did i go? [23:01:19] jdlrobson: You're not listed at https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160901T2300 [23:01:21] !log maxsem@tin Synchronized php-1.28.0-wmf.17/extensions/Kartographer: https://gerrit.wikimedia.org/r/#/c/308085/ (duration: 02m 54s) [23:01:22] feck i put it in the wrong box [23:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:01:28] Rrrggg i hate editing this thing [23:01:32] Crap I need to add more changes [23:01:51] 06Operations, 10Wikimedia-Mailing-lists: reset password of winedale-l - https://phabricator.wikimedia.org/T144416#2603647 (10Dzahn) 05Open>03Resolved a:03KFrancis [23:01:53] irish [23:01:58] moved it [23:02:19] RoanKattouw: >https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=819253&oldid=819235 [23:03:37] @seen Volker_E [23:03:37] mutante: I have never seen Volker_E [23:03:42] Got it [23:03:48] wm-bot: he's right here [23:03:55] Volker_E: hi? [23:03:58] I can't do the SWAT today BTW because OIT is trying to rescue my laptop from the water I spilled on it [23:04:07] And my SSH keys are on the hard drive that they're trying to recover [23:04:14] LOL [23:04:14] I'm using a loaner right now [23:05:38] jdlrobson is that irish? :16 [23:06:02] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 02m 53s) [23:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:06:17] irish? you've lost me. [23:06:36] jdlrobson feck i put it in the wrong box [23:06:37] !log That for https://gerrit.wikimedia.org/r/#/c/308084/ [23:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:06:45] I'm done, btw [23:06:49] I here that from only irish people like mrs brown boys [23:07:42] MaxSem: RoanKattouw: one of you SWAT? [23:07:48] 23:03:58 < RoanKattouw> I can't do the SWAT today [23:07:53] I've the answer [23:08:06] Dereckson: Sorry, I spilled water on my laptop so I don't have my SSH keys today [23:08:08] Okay, so I can SWAT. [23:08:11] Thanks [23:10:01] jgirault: ping [23:10:35] jgirault: if you're there, do you want to reschedule now https://gerrit.wikimedia.org/r/#/c/307965/ ? [23:12:42] ebernhardson: https://gerrit.wikimedia.org/r/#/c/306552/ is independant of the two other changes, isn't it? [23:12:53] Dereckson: right [23:14:00] Dereckson: thank you. Note one of mine is a labs change. [23:14:42] (03PS2) 10Dereckson: Disable phrase suggester for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306552 (https://phabricator.wikimedia.org/T143260) (owner: 10EBernhardson) [23:16:51] jdlrobson: please check tests at https://gerrit.wikimedia.org/r/#/c/307971 and https://gerrit.wikimedia.org/r/#/c/307970 [23:17:00] Currrently V -1 by Jenkins [23:17:11] (I've done a recheck on both) [23:17:15] Dereckson: those are session manager issues Fatal error: Uncaught exception 'DBQueryError' with message ' in /mnt/jenkins-workspace/workspace/mwext-mw-selenium/src/includes/db/Database.php on line [23:17:17] :S [23:17:18] so should be unrelated [23:18:00] Dereckson: shoot. https://gerrit.wikimedia.org/r/#/c/304306/ is probably needed [23:19:04] zero patch will be okay with https://gerrit.wikimedia.org/r/308094 [23:20:14] jdlrobson: "8 patches max" is a limit by window, not by requester [23:20:30] lol :) [23:21:01] sure. I'm not sure what to do in this situation though. The Zero and lazy images bugs are pretty critical [23:21:09] images are not loading for some people and Zero users cannot edit [23:21:17] (03CR) 10Dereckson: [C: 032] Disable phrase suggester for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306552 (https://phabricator.wikimedia.org/T143260) (owner: 10EBernhardson) [23:21:26] restore taglines on labs i can do outside window so you can skip that one [23:21:38] and "End lazy loading reference experiments" is a nice to have [23:21:44] (03Merged) 10jenkins-bot: Disable phrase suggester for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306552 (https://phabricator.wikimedia.org/T143260) (owner: 10EBernhardson) [23:21:57] it will just be anoying to wait another week [23:22:24] jdlrobson: add them to the calendar [23:22:30] Dereckson: they are all there [23:22:48] but https://gerrit.wikimedia.org/r/#/c/304306/ may be needed to get around that jenkins issue (although it should be fine to force merge) [23:23:07] RoanKattouw: an opinion about that ? ^ [23:23:32] RECOVERY - puppet last run on mw2202 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:23:32] jdlrobson: by the way 307967 is independant of all the remaining patches? [23:23:40] (03PS2) 10Dereckson: Enable Wikidata descriptions taglines on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307967 (https://phabricator.wikimedia.org/T143345) (owner: 10Jdlrobson) [23:23:43] they are all independent [23:23:45] If there's a known CI issue that's fixed in master, force-merging in the wmf branch is OK IMO [23:26:44] (03PS1) 10Paladox: Revert change that fixed the diff being cutoff [puppet] - 10https://gerrit.wikimedia.org/r/308095 [23:27:07] (03PS2) 10Paladox: Revert change that fixed the diff being cutoff [puppet] - 10https://gerrit.wikimedia.org/r/308095 [23:27:21] jdlrobson: for 307967 labs, I wonder if the ideal is only en (like you did) or for all the nowikidatadescriptiontaglines dblist? [23:27:36] (03CR) 10Catrope: [C: 031] "This change caused bugs with contextual commenting" [puppet] - 10https://gerrit.wikimedia.org/r/308095 (owner: 10Paladox) [23:27:46] Dereckson: For our purposes it should be enough - its the only one we test on. [23:27:50] * Dereckson nods. [23:28:09] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307967 (https://phabricator.wikimedia.org/T143345) (owner: 10Jdlrobson) [23:28:35] (03Merged) 10jenkins-bot: Enable Wikidata descriptions taglines on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307967 (https://phabricator.wikimedia.org/T143345) (owner: 10Jdlrobson) [23:30:48] mutante: could you do again the kludge you did for the train? [23:31:14] mutante: tin is still blocked by down proxy [23:31:59] We've 10 patches to sync with a 3 minutes timeout to wait for scap. Yeah. [23:32:30] !log dereckson@tin Synchronized wmf-config/InitialiseSettings-labs.php: Enable Wikidata descriptions taglines on labs (T143345, no-op in prod) (duration: 02m 52s) [23:32:31] T143345: Deploy Wikidata descriptions to mobile web stable channel Wikipedias 2nd half - https://phabricator.wikimedia.org/T143345 [23:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:34:40] just take the down proxy out of the list of proxies? [23:35:48] its the scap-proxies dsh group that it needs to come out of [23:36:10] bd808: You "just" need root powers for that [23:36:10] Dereckson: yes, i can [23:36:22] thanks [23:36:22] well, the dsh part [23:36:52] RoanKattouw: *nod* mostly for mutante's benefit rather than starting the host up again [23:37:05] well, i was told to shut it down again [23:37:26] if you mean the part where i powered up the server [23:37:26] yeah, which seems fine. we just need to make scap stop talking to it [23:38:19] !log tin stopping puppet [23:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:39:18] !log tin removed mw2187 from /etc/dsh/group/scap-proxies [23:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:40:01] jdlrobson: okay I've manually merged them [23:40:17] thanks mutante [23:40:46] i was looking for the other appservers too [23:40:52] that we removed earlier [23:40:57] but looks like they are gone now [23:41:04] try it [23:41:23] ebernhardson: why do you want 307954 on wmf/1.28.0-wmf.16 by the way? [23:41:36] Dereckson: because i tried to ship it this morning and this morning was canceled :P [23:41:42] we're 1.28.0-wmf.17 only [23:41:44] ok [23:41:45] Dereckson: can revert and not bother with .16 [23:42:10] ebernhardson: yes, please do a revert change for .16 [23:45:10] ebernhardson: 307955 Do not use the suggest reverse field if it's a non local search live on mw1099 [23:46:05] RoanKattouw: your patches are live on mw1099 [23:46:09] Dereckson: ok sec, seeing if it fixes the broken query [23:46:48] Dereckson: Thanks. I just realized I'll need to install the plugin to connect to mw1099 :/ [23:46:58] (Because I'm on a temporary laptop) [23:47:09] it's a no reboot pluging, that will be quick [23:47:12] Yeah [23:47:39] OK that took like no time at all [23:47:49] 26 seconds to Google it, find it, and install it [23:48:04] jdlrobson: MF and ZeroBanner patches live on mw1099 [23:48:09] Dereckson: works [23:48:10] Dereckson: on it [23:49:02] ebernhardson: ack'ed [23:50:01] Dereckson: OK, my patches are working. There's one aspect I can't test because it's specific to IE, and there's no X-WM-Debug plugin for IE I think [23:50:19] But I'll test that once it's fully deployed [23:51:03] ok [23:51:09] !log dereckson@tin Synchronized php-1.28.0-wmf.17/extensions/CirrusSearch/includes/Query/FullTextQueryStringQueryBuilder.php: Do not use the suggest reverse field if it's a non local search ([[Gerrit:307955]]) (duration: 00m 48s) [23:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:51:52] RoanKattouw: Duplicate get(): "enwiki:echo:seen:alert:time:15999850" fetched 3 times [23:51:54] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2167 is CRITICAL: Host mw2167 is not in mediawiki-installation dsh group daniel_zahn in scheduled downtime [23:51:54] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2168 is CRITICAL: Host mw2168 is not in mediawiki-installation dsh group daniel_zahn in scheduled downtime [23:51:54] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2169 is CRITICAL: Host mw2169 is not in mediawiki-installation dsh group daniel_zahn in scheduled downtime [23:51:54] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2170 is CRITICAL: Host mw2170 is not in mediawiki-installation dsh group daniel_zahn in scheduled downtime [23:51:54] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2172 is CRITICAL: Host mw2172 is not in mediawiki-installation dsh group daniel_zahn in scheduled downtime [23:51:54] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2173 is CRITICAL: Host mw2173 is not in mediawiki-installation dsh group daniel_zahn in scheduled downtime [23:51:54] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2174 is CRITICAL: Host mw2174 is not in mediawiki-installation dsh group daniel_zahn in scheduled downtime [23:51:55] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2175 is CRITICAL: Host mw2175 is not in mediawiki-installation dsh group daniel_zahn in scheduled downtime [23:52:21] Dereckson: That should be unrelated, none of my patches are in Echo [23:52:23] (I see that also when someone tests a non Echo patch on mw1099) [23:52:32] But we should fix that some time [23:52:36] I think there's a task about it somewhere [23:53:18] lazy image one is fixed Dereckson [23:53:26] looking into the zero one [23:53:49] Dereckson: https://phabricator.wikimedia.org/T144534 [23:53:50] Dereckson: turns out zero is near impossible to test [23:57:33] !log dereckson@tin Synchronized php-1.28.0-wmf.17/extensions/VisualEditor/lib/ve: Update lib/ve submodule for Ib9bbaccfff9 (duration: 00m 47s) [23:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master